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Pecontly,  sorre  research  interests  has  centered  around 
interprocessor  connections  for  SIMD  type  parallel  machines. 
However,  we  still  lack  a  methodlogy  for  evaluating  various 
networks.  In  this  paper,  we  first  present  some  new  results 
on  network  properties.  Then  we  show  how  to  exploit  various 
networks  in  ordinary  computations.  Finally  we  describe  how 
we  can  apply  the  theoretical  results  to  predict  the 
performance  of  some  network  in  a  real  program  environment, 
which  is  the  true  measure  of  network  effectiveness. 
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1.        INTRCDUC^ION 


Pocently,  component  speeds  have  been  improving  at  a 
tremendous  r^te.  Yet  there  are  certain  physical  limitations 
to  component  speeds.  Multiprocessing  then  seems  to  be  the 
area  to  show  the  mcst  proinise  for  any  further  speedup  of 
computations.  The  arrival  of  the  cheap  but  powerful  LSI 
microprocessors  greatly  increases  the  attractiveness  of 
multiprocessing  systems.  However,  a  big  problem  arises  in 
finding  the  best  way  to  interconnect  all  the  processors. 
The  questions  that  are  yet  to  be  answered  are  what  kind  of 
network  should  we  use,  how  should  we  compile  or  restructure 
computation  algorithms  in  order  to  use  it,  and  how  well  does 
it  work  on  ordinary  Fortran  programs. 

Many  interconnection  schemes  have  beer  proposed  or 
built  in  recent  years.  Thurber  [  1  ]  gives  a  survey  on  some 
of  the  Hore  important  cnes.  However,  each  of  the  networks 
proposed  or  built  has  different  requirements  to  fulfill  and 
their  implementations  are  based  on  different  theoretical 
i3ackqroands.  Frequently,  their  capabilities  are 
incompletely  known,  and  their  control  algorithms  are  poorly 
understood.  Hence  it  is  very  difficult  tc  categorize  or 
assess  th^  merits  of  each  of  these  networks.  Chapter  2  of 
this  thesis  investigates  the  theoretical  part  of  network 
capabilities.  New  network  properties  are  presented  which 
helo   us  to  utilize  certain  networks  more  efficiently.   This 
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section  also  finds  some  ways  to  simplify  the  control 
algorithms  for  realizing  some  coirmonly  used  permutations  on 
certain  interconnection  networks.  By  sirrplifying  the 
control  algorithms  of  certain  networks,  we  reduce  their 
network  corrplexity  anij  increase  its  feasibility. 

The  attractiveness  of  a  connection  network  depends 
not  only  on  its  ability  to  handle  some  permutations,  but 
also  on  the  efficiency  in  which  some  common  computations  can 
be  mapped  into  the  processing  system  where  it  is  located, 
3y  mapping,  we  refer  to  the  entire  spectrum  of  strategies 
including  arithmetic  operation  scheduling,  memory  storage 
scheme,  and  intermediate  data  routing.  It  car  be  shown  that 
a  great  part  of  any  Fortran  program  can  be  represented  as 
either  array  operations  or  recurrence  systems.  So  a 
meaningful  processing  system  will  have  to  be  highly 
efficient,  in  mapping  array  or  recurrence  operations  into  the 
system.  However,  it  is  also  desirable  to  recognize  some 
commonly  used  computation  algorithms,  like  matrix 
multiplication  and  Fast  Fourier  Transform,  and  exploit  the 
best  execution  sequences  and  mapping  strategies  in  some 
given  processing  systems.  Chapter  3  deals  with  the 
restructuring  of  certain  computation  algorithms  and  the 
exploitation  of  certain  processor  systems. 


The  true  measure  of  the  effectiveness  of  a  processor 
interconnection   scheme  (or  even  a  certain  parallel  computer 


orqanization)  lies  on  its  performance  in  a  real  program 
environment,  not  on  operation  by  operatior,  basis,  nor  on 
computation  by  computation  basis,  A  current  joint  project 
deals  with  the  simulation  of  parallel  processing  systems. 
Input  programs  will  first  be  parallelized  by  a  program 
analyzer.  The  parallelized  program  graph  is  then  input  into 
the  Resource  Reguest  Generator  (RPG)  which  then  compile  it 
into  ps.eudo  machine  code  for  a  specific  parallel  processing 
system  configuration.  The  RRG  will  use  most  of  the  known 
capabilities  and  properties  of  interconnection  networks  and 
parallel  processing  systems*  Finally  the  fseudo  machine 
code  is  simulated  on  a  Simulator  which  will  produce  a  set  of 
performance  measures.  This  allows  us  to  evaluate  various 
oarallel  processing  system  designs  upon  their  performances 
on  real  programs.  Strategies  and  algorithms  for  this  work 
are  presented  in  Chapter  U.  Chapter  U  will  also  include  a 
discussion  of  some  preliminary  results. 
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2.   NETWORK  CAPABILITIES 

2. 1  Introduction 

In  general,  processor (/memory)  interconnection 
networks  can  ho  divided  into  two  classes.  The  first  class 
has  multiple  stiaes  of  switching  elements.  The  second  class 
has  only  a  single  stage  of  switching  elements  and  this  stage 
may  have  to  be  recycled  many  times  to  obtain  certain 
permutations.  '^yamples  of  the  first  class  are  the  Catcher 
networkf2^,  the  feenes  networkf 3  ],  the  omega  retwork[U],  the 
barrel  shifter,  and  the  Fenq»s  data  manipulat cr[ 5  ].  Networks 
such  as  the  Tlliac  IV  connection[ 6  ],  the  Swanson 
connect ion[ 7  1,  the  +1  shift  network  ani  the  perfect  shuffle 
network[6  1  arc  good  examples  of  the  one  stage  networks. 
Although  single  stage  rietworks  may  he  slower  in  performing 
general  permutations  of  data  than  the  multistage  networks* 
they  are  much  cheapor  in  comparison.  Tf  wt  can  restructure 
and  recomfiile  sorne  of  the  comironly  used  computation 
algorithms  into  algorithins  which  fully  utilize  the  available 
connectivities,  we  can  retain  the  performance  level  while 
drastically  decreasing  the  cost  of  tae  processing  system. 


Another  way  ot  categorizing  interconrection  network 
is  based  on  the  shuffle  connection.  liolomb  [:)],  Pease  [10], 
and  Stone  [8]  riafine  the  perfect  shuffle  permutation  of  a 
vector  of  N  indices  as 


P(i)=:i  C<i<    N/2 

=     (2i  +1)  irodN         N/2    <    i    <    N. 
let.    us    extend    this   definition    to    (x,y)     shuffle   by    defining 

S(i)     -      x(i   mod    y)•^[i/yJ  o    <    i    <    N 

tlote    that    th<o    perfect    shuffle    is    just    a.    {2,H/2)    shuffle. 


laterconnect ion  networks  sucn  as  the  Eatcher  network, 
thf  Benps  network,  the  oiTiegi  network,  and  thf  Banyan  network 
[11]  have  stages  of  switchiny  elcrrents  irite  rconnectod  witn 
shuffle  connections.  Cn  the  other  hand,  examples  of  the 
nor.-sh  uf  fle-based  networks  are  the  crossbar  switch,  the 
Tlliac  I^  petworK,  the  barrel  shifter,  the  Swanson 
connection,    and    Fcnq's    data    manip'.:  lator . 

Cne  of  the  it  uLt  iple-stage  shuffle-based  networks 
'which  has  been  of  particular  interest  to  us  in  recent  years 
is  the  omeaa  n'^twork.  This  network  cannot  perforin  all 
connections  of  its  inputs  to  outputs,  yet  it  is  capable  of 
producing  rrost  of  the  connections  required  by  nunerical 
prograrrs.  Because  ot  the  incomplete  capatilitics  of  this 
network,  it  is  n-jcessdry  to  anily2G  the  network  to  determine 
exactly  which  connections  it  can  produce.  Section  2.?  will 
discuss  the  control  structures  and  core  rew  results  on 
capabilities        of      the      ornega      network.  These      results      are 

essential      to      some      of      the      algorithm         solving         abilities 
discussed    in    Chanter    3, 
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he  relations  ot  the  Eitcher  and   Penes   networks   to 


tho  shuffle  connections  are  being  discussed  in  Sections  2,3 
and  2.4.  Section  2.'i  discussed  various  routing  techniques 
for    permutations    using    one-stage    perfect   shuffle   networks. 


2.  ?    0[nG(ja_Nf>twork 


2.2. 1  £ontrol_S true turGs_fgr_Ome£a_Not work 


2.2.1.  1  Source/Dostindtion_Ta2_MGtho^ 


•"he  omcTn  network  is  attractive  not  orly  because  of 
its  low  qate  complexity,  bat  also  b<^'caus€  of  its  control 
simplicity.  There  are  a  numhor  of  ways  to  control  the  omeqi 
network.  T\\^  most  fandamental  laethoci  is  the  destination  taq 
method[''-i]  which  uses  K  destinatioii  tags,  each  of  log  N  bits, 
^ach  source  port  has  a  tag  which  r'^-inresents  the  destination 
port  numbers  the  data  element  intends  to  reach.  log  N  stages 
will  be  required  to  set  the  network,  and  as  each  stage  is 
set,  the  data  elements  will  be  switched  accordingly.  This 
control  method  can  he  used  to  pass  all  omoqa-passa ble 
ii^XiBMl^  t  ions . 

A  more  general  method  is  the  source  tagging  met  hoi 
[U].  Instead  of  using  the  destination  tags,  source  tags  are 
used.  One  source  tag  is  associated  with  each  output  and 
represents  the  input  to  which  that  output  port  will  be 
connected.  The  source  tagging  method  c<\n  pass  all 
oneqa- passa.ble  connections,  including  t  h<='  one- to-many- 
connections  that  Cannot  be  realize-i  by  th(^  destination  tai 
methoa.  However,  the  iiiain  drawback  is  that  the  network 
control  has  to  be  set  stage  by  stage  fLora  the  last  stage  to 
the  first  stage  before  the  data  elements  can  be  passed 
through  the  network,  flenca  an  extra  C  (log  N)  gate  delays 
will  be  nee.ied  Cor  a  conection.   Details  of  these  two  control 
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methods  are  dosccibed  in  [  U ], 

A  inodifiGd  destination  tag  method  that  will  allow 
certain  broadcasting  (one-to-many)  connections  will  be 
presented  here.  Using  this  modified  method,  he  will  tio  away 
with  the  extra  ©(log.N)  gate  delays  needed  for  the  source 
taoging  method.  however,  the  broadcasting  furctions  will  be 
limited  to  only  one- to-power-of-two  elemients.  For  example, 
the  following  connection  will  not  be  realizable  by  this 
method,  cut  can  he  realized  using  the  source  tag  rn  thod. 

Example 


Source 

Hestinition 

0 

0 

0 

1 

0 

2 

2 

3 

For  this  modified  tag  method,  instead  of  allowing 
only  0  or  T  for  each  of  the  log^N  tag  bits,  we  allow  '0', 
•1',  •♦•  or  '-'  for  each  of  the  log.N  destination  tag 
characters.   So  a  source/iest inat ion  pair  may  now  look  like: 

(0  1  1  ■)  1,  0  *  1  1  *) 

This  pair  will  represent  a  one-to-four  broadcasting 
function  of: 


(0  1  1  0  1,  0  C  1  1  0) 
(0  1  1  0  1,  0  C  1  1  1) 


(0  1  1  0  1,  0  1  1  1  0) 
(0  1  1  0  \  0  1  1  1  1) 

The  tag  characteu  •*'  takes  on  the  value  of  all 
possible  binary  diqits,  while  'C  and  ' ''  »  still  have  the 
original  meaning.  For  completeness,  we  have  to  use  the  tag 
character:  '-'  to  specify  no  connection.  So  for  a  complete 
Source/destination  set,  we  might  have: 


Fyample 


(0  0,  *  0) 

(0  1,  1  1) 

(1  0,  0  1) 

(1  n  -  -) 


source  port  destination  port 

0  0 

0  2 

1  3 

2  1 


It  should  bG  clear  by  nOw  that  this  modiiied  tag 
method  can  only  be  used  if  the  power  cf  two  (say,  2  ) 
destination  ports  that  a  source  port  is  connected  to  have  the 
same  (log. N-h)  bits  in  their  tag  bit  representations.  This 
modified  destination  tag  inethod  also  provides  a  great 
notational  advantage  over  the  source  tag  method  when  we  have 
to  describe  a  one-to-pcwe r-of- two  broadcasting  functions.  As 
can  be  seen  in  later  chapters,  most  of  the  common 
broadcasting  functions  are  of  this  type  and  can  be  easily 
described  oy  this  modified  tag  method. 


SIS 
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2.2.1.2  Co  1  iiiin_ Con trol_ Method 

To  control  an  omega  nctwoclc  of  size  N  kN  for  any  omega 
passable  E§£iDJiil3.ili2Q/  ^^  would  require  Nlog  N/2  control  bits. 
TSach  switching  element  will  either  be  set  to  the  'cross  over* 
or  *strdiqht  through'  state  according  to  the  value  of  the 
corresponding  control  bit.  Suppose  we  are  willing  to 
sacrifice  some  of  the  capabilities  of  the  network  in  order  to 
further  simplify  the  control  structure.  If  we  use  only  log  N 
control  bits,  each  controlling  a  complete  stage  of  switching 
elements,  then  we  will  have  the  column  control  method.  An 
omega  network  utilizing  column  control  method  turns  out  to  be 
exactly  the  same  as  Batcher's  scrambling/unscrambling 
network[  12  !•  As  pointed  out  in  [12],  the  scrambling/ 
uncramblinq  network  can  be  constructed  with  1  eg  N  levels  of 
selections  and  perfect  shuffles,  just  as  an  onrega  network. 


Let  s..   he  the  1th  most  significant  bit   of   the   bit 
•J 

representation  at    the  ith  source  port  and  d—  be  the  ith  most 

significant   hit   of   the   bit   representa  tior.   of   the    ith 

destination      port.         Mso      lat    p.     be   the    jth    irost    significant 

J 

control  bit.   Then  for  any  permutation  to   be   realizable   by 

Vj=1...1og2N, 
Vi=1 . . ..N. 

where  ®  is  the  exclusive-or  operation.   More  details   can   be 

found  in  [  12  1. 


the  column  control  method,  dj.  =  Sj.  ©  p: 


Since  there   is   a   total   of   only   2**(log  N)   (=N) 


11 

different  sots  of  p.  '3,  we  can  pass  at  most  i:  rjistinct 
permutations  using  this  method.  Using  the  aigumonts  in  [12], 
we  can  see  that  if  the  (i, j) th  element  of  an  NxN  matrix  is 
stored  in  the  itb  position  of  memory  module  i®j  (0<i,j<N), 
then  we  can  fetch  any  row  or  column  of  the  matrix  without 
conflict  =\nd  the  data  can  be  aligned  usir.g  this  column 
control  method.  Hence  if  column  and  row  accessings  are  all 
that  are  required,  we  can  simply  use  a  very  easily  controlled 
column  control  method  for  cur  omega  network. 


....  (log  N-1) -shuffled  versions  of  the   permutations. 


It  is  logical  to  think  that  by  increasing  the  number 
of  shuffle-exchange  stages,  we  would  te  able  to  obtain  morq 
permutations  that  can  utlilize  this  column  control  method. 
Unfortunately,  it  can  be  shown  that  by  increasing  the  number 
of  shuffle-exchange  stages,  we  can  only  derive  the  original 
set    of    permutations    plus   the   1-shuffled,   2-shuffled, 

Hence 

the  maximum  number  of  this  set  does  not  «^xceed  NlogoN.  The 
proof  will  not  be  (elaborated  here,  but  Figure  2.1  shouli 
illustrate  this  phenomenon.  Suppose  that  wt  want  to  have  a 
five-stage  np>twork  using  this  column  control  method  (either 
by  building  five  stages  of  shut f le-exchange s  or  recycling  a 
one  stage  network  five  times) ,  and  also  that  we  set  p  to 
CI  111.  ,  The  permutation  that  we  get  at  the  output  will  be 
(5  7  1  3  4  6  0  2)  for  an  6x8  network.  However,  the  fourth 
most  significant  bit  of  p  will  force  a  shuttle  and  then 
'exclusive  or'  with  the  most  significant  i>it   of   the   source 
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Figure  2.1.2  IntcrmGdiate  patterns  asing  p=0  11®1 1 r=1 01 

plus  two  shuffles  at  the  end 


T'igure  2.  1 
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taqs  ^qain.  A  similar  etfect  will  b«  produced  hy  the  least 
significant  bit  of  p  on  the  second  most  significant  bit  of 
the  source  tags.  The  net  effect  will  then  be  equivalent  to 
setting  p=0  1 1©1  1  C  (=  10 1 )  and  adding  two  extra  shuffles  at  the 
end.  Note  that  the  output  of  Figure  2.1.2  is  also 
(5  7  1  3  U  '^  0  2)  .  So  by  setting  p=01111,  we  only  result  in 
the  2-shuffled  version  of  setting  p=101.  However,  it  should 
be  notsd  that  the  Nlog  N  upper  limit  on  number  of  passable 
permutations  onlv  applies  to  the  column  control  method.  Foe 
any  individual  switch  control  irethod  (such  as  the  source/ 
destination  tag  method  and  the  ROM  method),  the  upper  linit 
depends  on  the  nnmher  of  stages. 


The  perinutation  capabilities  of  the  log  N  stag^^ 
column  controlled  omega  network  are  well  delirod  by  [12],  and 
we  will  not  rePeat  his  results  here.  However,  we  can 
increase  the  capabilities  of  a  column  controlled  network  to 
allow  certain  broadcasting  functions  by  usi rg  two  control 
bits, bo  and  b,  t  in  each  column.  In  Chapter  3,  it  will  be 
shown  what  broadcasting  functions  can  be  realized  by  this 
method.  The  switching  functions  for  various  values  of  b^  and 
b,  are  shown  in  Figure  2.2. 
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Figure   2,2 
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2.2.1.3  PC iM  _Con t ro l.He  t hod 

To  implement  the  source/destination   tag   method,   we 
either    use    a    fast    method   [13]   which   would   require 

SNlog^  N  (<^*  (loGfj^"^^  ^■^^  qates,  or  a  slower  method  [T^]  wnich 
needs  (Ud* 1 1 ) Nlog  N  gates  but  requires  the  use  of  strobes  at 
each  stage  to  pass  the  tag  bits  along.  The  column  control 
method  also  needs  the  use  of  stroties  if  a  one-stage 
shuffle-exchange  network  is  used*  So  the  propagation  delay 
through  the  network  is  On  the  order  of  log  N  clocks.  In  this 
section,  we  will  propose  another  control  method  which  can 
eliminate  the  use  of  the  strobes  (clockings)  without  paying 
too  much  penalty  in  gate  counts.  This  POM  control  method 
provides  a  faster  m--»thod  to  evaluate  the  control  function  at 
each  of  th^  Nloq  N/2  switches  simultaneously  and  these 
functions  are  imposed  on  all  stages  of  the  network  at  the 
same  time.  3o  instead  of  taking  log  N  clocks  through  th-^ 
network,  we  would  only  require  a  couple  of  clocks  for  the 
source  data  to  be  routed  through.  Ihis  oreatly  reduces  the 
network  delay  for  the  processing  system.  This  method  does 
not  pass  as  many  permutations  as  the  source/ destination  tag 
method,  but  it  will  pass  many  of  the  more  common  ones.  It 
car  pass  all  shift,  flip,  (c-i)  ,  and  odd-ordered  vector 
unscrambling  permutations  in  any  power  of  two  partitions. 


Ho  again   assume,   as   in   Section   2.2.1.^;,   that   a 
control   bit   value   of  1  will  set  d  switching  element  to  the 
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•cLOss  over*  state  and  the  value  of  0  will  set  it  to  the 
•straight  throuqh'  state.  The  basic  idea  is  to  fetch  N/2 
control  bits  from  each  of  log^ N  ROM's,  accocding  to  which 
permutation  function  is  called  for.  The  array  of  log  N  x  N/2 
control  bits  is  called  the  control  iritrix  ax\d  can  be  imposed 
on  the  omega  network  to  facilitate  the  corresponding 
permutation  function. 

For   ex-imple,   the   control   matrix   for   a   1-shift 
permutation  in  a  UxU  oirega  network  is: 


Imposing  it  on  an  omega  network,  we  g€t: 


o   O 


-o   1 


O   2 


o  3 


fee    will    iirmediatcly    get    the    following    permutation    : 


17 


0  -■ 

1  -• 

2  - 

3  -■ 


>  1 
•>  2 

>  3 
•>  0 


, which  is  a  1-shift  permutation. 


In  order  to  miniinizc  the  amount  of  ROf!  space  required 
for  different  families  of  periratations,  as  many  common 
characteristics  are  recognized  frorr  the  control  patterns  as 
possible.  Then,  by  using  sone  extra  logical  operations,  i<e 
can  form  the  same  control  pattern  with  much  less  ROM  space 
required.  We  first  observe  that  the  flip  permutation 
actually  belongs  to  the  f airily  of  (c-i)  permutations,  with  c 
set  to  N-1.  Then  we  observe  that  the  cor.trol  matrix  for 
(c-i)  permutation  is  the  bit  by  bit  con'pleirent  of  the  control 
matrix  for  a  (N-1-c)  shift  permutation.  Note  that  (N-1-c)  is 
the  hit  conrplement  of  the  bit  representation  cf  c. 


Example: 


For  S- shift: 


For  N  =  '^,  the  control  matrices  are: 


1  0  1 

1  0  1  j 

I  1  1 

0  1  1 


For  (2-i)  perm. : 


C  1  0 
C  1  0 
COO 
1   0   0 


kith  this  knowledge,  we  can  already  reduce  the  three 
classes  of  permutations  (sh if t ,  (c-i),  and  flip)  into  one  set 
of  control  oatterns. 

We  further  observe  that  for   any   pernutation   to   be 


IS 
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done  in  a  smaller  partition (say,  2  ) ,  we  only  need  to  use  the 
same  control  matrix  as  for  the  same  permutation  in  the  full 
network,  with  the  exception  that  the  leftmost  (iog^N-h) 
columns  will  be  set  to  all  O's,  This  fact  enables  us  to 
greatly  reduce  the  number  of  control  matrices  for  all  the 
partitioned  permutations  by  simply  using  a  sirall  number  of 
AND  gates  controlled  by  the  partition  size. 

Figure  S.3  shows  the  control  matrices  for  both  the 
3-shift  permutation  in  full  8x8  network  and  the  3-shift 
permutation  in  U-partition  of  an  8x8  network. 


0  1  1 

1  1  1 
1          0  1 

1       0  1 

3-shift 


0 

1 

1 

0 

1 

1 

c 

c 

1 

c 

c 

1 

3-shift    in    U-partition 


Figure  2.3  Control  Matrices  for  3-shift 
in  different  oartitions 


Note  that  the  only  difference  between  the  t^o 
matrices  is  that  the  second  one  has  all  C's  in  the  first 
column. 


Fven  with  these  Control  tecbniaues,  the  amount  of  ROM 
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space  required  for  all  the  shift  patterns  are  still  high. 
There  are  N  different  shifts  and  each  requires  Nlog-N/2  bits. 
So  a  total  of  N  log-N/2  bits  will  oe  required.  However,  we 
can    extract    more      inf orinat ion      froir      the      shift      patterns      to 

2 

reduce  the  total  R0?1  space  down  to  N  /2  bits,  which  greatly 
increases  the  feasibility  of  this  control  method.  The 
control  patterns  for  various  shift  permutations  for  N=3  are 
shown  in  Figure  2.U.  Let  n=log  N  and  S|S2....s^  be  the  bit 
representation  of  the  shift  distance.  In  general,  there  are 
N  different  patterns  for  the  iaftinost  colurrn.  However,  only 
N/?  of  them  are  the  basic  patterns,  and  shift  distances  with 
the  sarre  S2Sg....s^  will  have  the  same  basic  patterns.  Hence 
s  3^....s^  can  be  used  as  the  address  to  access  the 
corresponding  pattern  stored  in  the  PCM.  The  RGM  for  this 
first  column  will  have  N/2  entries  of  N/2  tits.  The  exact 
pattern  will  bo  the  'exclusive  or'  of  s,  independently  with 
the  column  of  control  bits.  For  the  second  column  of  the 
control  natrices,  there  are  N/4  different  basic  patterns  and 
shift  distances  with  the  same  S2S^..,.s^  will  have  the  same 
basic  patterns.  The  exact  pattern  will  bo  the  'exclusive  or' 
of  Sg  with  the  column  of  bits.  The  number  of  basic  patterns 
decrease  from  left  to  right.  The  last  column  will  have  only 
one  basic  pattern  and  will  be  exclusive-ored  with  s^,  to  form 
the  correct  oatterns.  The  total  number  of  basic  patterns 
required  for  the  shift,  (c-i)  and  flip  perirutations  in  any 
power  of  two  partition  can  then  be  realized  using  a  total  clOM 


> 
X 
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shift 
distance 
0 


control 

pattern 

0         0         0 

0  0  c 
0  0  0 
0         0         0 


1 
1 
1 
1 

0 
0 
0 
0 

1 
1 

0         1 
0         1 


shift 

c 

lontrol 

distance 

pattern 

4 

1 

0   0 

1 

0    0 

1 

0    0 

1 

0    0 

5 

1 

0    1 

1 

0    1 

1 

1   1 

0 

1   1 

6 

1 

1    0 

1 

1    0 

0 

1    0 

0 

1    0 

7 

1 

1   1 

0 

1   1 

0 

0    1 

0 

0    1 

igure    2.4  Control   Patterns    for    Shift    Peciriutations 

N  =  8 
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space  of  ( U2+U+.  .  .  N/2)  bits  =N(N-1)/2  bits,  with  the  help  of 
some  additional  loqic  elements. 


R  similar  phenomenon  to  that  in  the  shift  patterns 
can  be  seen  for  odd-ordered  vector  uascramblir.g  permutations. 
The  control  patterns  for  various  odd-ordered  vector 
unscramblinq    for    N=16   are   shown   in   Figure   2.5,    Let 


P  P/s*  •  • 


p   be   the   bit   representation   of   the   order    of 


Unscramblinq.  Then  P.P^....p«  .  will  be  used  as  an  address 
to  fetch  the  basic  pattern  for  the  first  column,  P2.,,.p^_, 
as  the  address  to  fetch  the  pattern  for  the  second  column, 
and  so  on.  The  output,  however,  does  net  need  to  be 
exclusive-ored  with  p.  (as  in  the  case  of  shift  patterns)  to 
produce  the  correct  patterns. 

A  possible  organization  of  the  control  system  is 
shown  in  Fiqure  2.6.  Using  microprogramiring,  we  can  set 
k,k2....k^  to  3,S2....Sj^  for  shifts,  to  c^c^'-'-c^  for   (c-i) 


permutations,       and      to      Op^  pg 
unscrairbling. 


Pn-. 


for  odd-ordered  vectoc 


The  basic  control  patterns  have  to  be  generated  and 
input  into  the  POI's.  The  basic  shift  patterns  can  be 
generated  quite  easily.  There  will  be  N/2'*  entries  in  the 
jth  RCM  from  the  lof t  ( 1< j<log  N)  .  The  0th  entry  in  each  of 
the  FCK's  will  have  all  1's.  For  the  kth  entry  (1<k<N/2'')  in 
the  -jth  FCM,  the  least  significant  (k.2*'"  )  control  bits  will 
be  1's,   while   the   rest   are   all   O's.    To   generate   the 
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0 

0 

0 

0 

0 
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Figure    2.5      Control   Patterns  for   p-unscrambling 
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p-ordered  vactor  unscrambling  patterns,  it  requires  a  bit 
irore  work.  For  the  first  stage  pattern,  the  ith  control  bit 
(0<i<N/2)  for  p-unscrambling (i.e.  the  (p- 1) /2  th  entry  in 
the  PCM),  is  the  (log  N) th  least  significant  bit  of  the 
product  (p.i).  For  the  jth  stage  pattern  (i/1),  the  ith 
control  idt  (0<i<N)  in  the  kth  entry  (0<k<N/?  )  will  be  the 
same  as  the  (|i/2''"'J  *2^"'  )  th  control  bit  in  the  kth  entry  of 
the  first  stage  !?ca. 

This  section  indicates  how  the  use  of  ROM's  can 
eliminate  the  need  to  clock  the  omega  network,  at  the  expense 
of  some  relatively  inexpensive  hardware.  The  set  of 
allowable  permutations  is  quite  rich.  Hence  the  ROM  method 
should  be  considered  as  a  major  alternative  to  the 
source/destination  tag  method. 

It  should  be  pointed  out  here  that  an  alternative  to 
the  POM  may  be  the  use  of  an  initial  set  of  control  bits  that 
are  being  routed  through  the  network,  but  with  some  logical 
operations  at  each  stage.  This  method  has  been  discussed  by 
Lang  and  Stone[15],  and  will  not  be  repeated  here. 
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2.2,2  Omoga^Partit  lQn_Th»^o reins 

One  important  property  of  the  ometja  ret«ork  is  its 
ability  to  be  partitioned.  The  theoreins  in  this  Section  will 
show  that  a  large  size  omega  network  can  be  regarded  as  a 
conglomeration  o£  riany  smaller  size  omega  networks,  each 
passing  a  different  smaller  omega-passa tie  connection 
function.  These  partition  theorems  help  to  establish  many 
capabilities  of  a  larger  sizo  network  on  smaller  partition 
connections. 


(iiven  an  6x^  omega  network.  Assume  that  source  ports 
0-3  want  to  do  an  end-around  1- shift.  As^5ume  also  that 
destination  ports  U-7  request  data  from  source  port  5.  So 
the  corplete  set  of  source  destination  pairs  is  P  = 
C(0,1)  ,(1,2)  ,  (2,^)  ,  (3,0)  ,  (5,4)  ,  (5,5),  (5,C)  ,  (^,7)}  .  He  know 
that  a  uxU  omeqa  networ.<  can  perform  an  end-around  1-shift, 
as  well  as  a  one-to-many  broadcasting  functior.  3y  using  the 
partition  theorem  stated  below,  we  can  be  sure  that  an  8x3 
omega  rietwork  can  pass  P, 


I ' 


Before    we    '^tate    the    partition    theorems,    we    would    liice 
to    list,    without    proofs,    some    numier    theory    results. 


RO)       a=b,    c=d    -->    ax+cy    =   bx+dv 
mm  rn 

Rl)       x+a    =  y+rt    -->    x=y 
m  m 

P2)       x=y    -->   ax    =  ay 
m  m 

F;3)   If  a  is  prime  to  ir  (i,e.  gcd  (a,rii)  =1 )  y  then 

ax  =  ay  -->  x=y 

m         m 
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FU)   ax  =  ay  < — >  x=y 

am  m 

ES)   If  0<x,y<in,  then 


X  =  y  <-->  x=y 

am         TO 


m 


P6)   x=Y  and  xf^y  -->  x  ^  y 

El  through  B6  can  be  found  in  [U]. 

We  will  also  present  Leir.ma  2.1  which  is  extended  from 
Lemma  1  of  [ft!  and  some  of  the  above  number  theory  results. 

Lemma.,  2.  1 

Let  0<x,  ,y,  <n-1,  and  O^Xj^y  <a-1.   Then 

ax.+Xg  =  ^y.+Ya  <-">  «i  =7,  and  X2=yj,. 
an 


Proof : 

a)  to  prove  ax,  ^-Xg  e  ^y, +y  — >x,=y,   and  ^z'^iz 

an 

Lemma  1  of  [ U  ]. 


proved   in 


b)  to  provt^  ^1  ^Yi  '^'^d  X2  =  y2 — >dx,  +X2  =  ay,  ^l^'      assume   x,=y, 

an 

and    X2=Y2  • 

By    TRU,    ax      =    ay,-       Since    0<:Lr.,'i^<^,    then      by      P5,       we 
an 

have    X-    =   y„  .       So    by    RC,    we    qet    ax,+x.    =   ay    +y,,    QED. 


2  ~  '2 
an 


I   -^2 


an 


1   -'2 


To  simplify  the  proof  of  the  partition  theorems,  we 
need  to  restate  Theorem  2  in  [U],  which  dictates  whether  a 
given  connection  is  omega-passable  or  not. 


Lemma_  2.2  (Equivalent  Statement  of  Theorem  2  in  [ U  ]) 

Given  a  set  of  desired   input-output   connection   P  = 

{(5.  ,D.)  I   0<i<N}  ,  then  an  NxN  omega  passes  P^^  if  and  only  if 

k 
for  all  S-D  pairs  in  P^  and  for  all   m  =  2  ,   where   ''l^SloggNr 

3.  =  S.  or  S.  53;  or  D.^feD.. 
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Let      PL=f  (S;,di  )  |0<i<L}       and         Pmj  =  H  t  ij  ,  e  j  j  )  |0<j<M}, 

3<i<L.       We    define    Pn=?lX  {P„^  ,  P^,^  , ?m^_,    } 

=  {(S;M*tij  ,d;M*e,j  )  |0<i<L,    0<j<M} 

For    example,    let    L=U,    M^U    and    N=16.       if 
P,^  =  f  (0,  1)  ,  (1,2)  ,  (2,3)  ,  (3,0)}  ,    a    1-shift    permutation, 
^M,  ^  f  ^^'^)  '  ^^''')  '  ('^'2)  ,  (0,3)}  ,    a    1-to-a    broadcast    connection, 

PM2  =  n^'^)  '  <^'^'  '  ^-'''^  '  ^^'^n  r    a    flip   permutation, 

■^1*13  =  C(*^' ^)  »  C'"^)  '  (2'2)  ,  (3,  1)}  ,    a    3-ordGr    unscrambling,    and 

°t=n-»2)  ,  (1,3)  ,  (2,0)  ,  (3,1)}  ,    a    2-shift    permutation. 

Then  by  definition,  Pf^=  {  (C  ,9)  ,  (1 ,1  C)  ,  (2,  1 1)  ,  (3,  8)  , 
(4,12)  ,  (a,13),  (4,V4)  ,  (U,15),  (8,3)  ,  (9,2)  ,  (10,1),  (11,0)  ,  (12,u)  , 
(13,7)  ,     (n,6)  ,  (15,5)}  , 

Tn  words,  the  sources  and  destinations  of  P^  are 
divided  into  U  partitions.  P^  is  the  inter-partition 
permutation  function,  and  Vf^.  's  are  the  individual  partition 
permutations.  ?^_  moves  partition  *0  to  partition  ^2  and  then 
the  individual  elements  in  partition  #2  will  he  T\oved 
according  to  P>^  ,  and  so  on,  A  pictorial  illustration  of  Pm 
is    shown    in    Figure    2.7. 
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With  all  these  preliminary  definitions  and  lemmas,  we 
can  present  the  first  cf  the  oireqa  partition  theorems. 
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Figure  2.7  A  Partitioned  permutation 
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TheoreiT    2.  1 


Let    nLt^L    ^"^   ^t^^fA-  '       Olii^'      ^^^      l^*"      N=Lx«,      then 
^N^^N^^L^fPo'P \-,>- 

We  first  present  a  simple  sketch  proof   in   Proof   1. 
then  we  present  a  more  rigorous  proof  in  Proof  2» 

Proof 1 

Assume    n.^P|^.       By    lemina    2.2,    there    exist      S,  -SpM+tpq  , 

D.=dpM+epq,    S2~Su^**-MV»    ^l~^^^*^o\/    ^^^    ^    such    that    S,  ^  S2    and 

X 

D,  S  D2. 

Let  m=logM,  n=loqN,  b=logL,  and  x=loqX. 
It  X>Mr  pictorially  we  have: 


^2 


4— b- 

Sp 

-».«— 

—  m — ► 

tpcj 

Su 

t^jv 

—  .    ,          „    ,  ■* 

[J^A 


1± 


'UV   I 


D2 


m 


c;; 


i^» 


b*-ir-x 


Here  the  trailing  x  bits  of  S,  and  S2  are  equal,  but 
the  leadinq  (b+ir-x)  bits  are  not  equal,  and  the  leadinq 
(b^m-x)  bits  of  D|  and  D2  are  equal.  Since  x<in,  the  trailinq 
(a  =  x-in)  bits  of  Sp  and  s^^  are  equal  but  the  leading  (b-a) 
bits  are  different  and  the  leading  (b-a)  bits  of  dp  and  du 
are  equal.   This  contradicts  Slfv,   . 


M 
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If    X  <M,    pictorially  we   have: 


b  — *•< —  ir. 


^1    I  ^p  r  '^pi 


uv 


dp 

epq 

du 

®uw 

^ — ->— 

fc+ni-x 


Since  the  leading  (b+m-x)  bits  of  D,  and  D2  are  equal 
and  ni>x,  we  have  dp=du  and  ^pa—  ^uv*  Since  n^fPL  ^^^  dp=du, 
p  has  to  be  equal  to  u.  So  Sp=Sy .  This  implies  that  tpa=tuv 
since  S,^  S^ .  S,s  S^  and  X<M  imply  that  tpa«t^j^.  Setting 
p=u,  we  get  eposepy,  <^po^  tp^  and  tpostpy.  They  imply  that 
^m' °Mp '  which  is  a  contradiction.  Hence,  Theorem  2.1  is 
proved. 

Proof    2 


Assume   n^^P^r    then    by   Lemrra      2.2,     3  (SpM  +  tpq  ,dpM-»-ep<^) 
and    (SuM+tyj^  ,duM+euv)    and   X=2*    such   that: 
(a)    SpM+tpcj^s^H  +  tuv' 


and 
and 


(b)  spM*tpq  =^s^M*tm/ 

(c)  dpn>ep(^=dyM>eu^ 


Ey    Lemma    2.1    and    (a),  Sp  ^  Sy    or  tp^  5^  t^^^  (2.1) 

If    XiH,    let    A=X/M=2*. 

By   Lemira   2.1    and    (b)  ,      Sp«  s^  and    tp<^=  t^^^       (2.2) 

From    (2.1)     and     (2.2)    we    get  Spl^Sy  (2.3) 

and  Sp=  Su  (2.4) 


By    detinition,     (c)     implies    X 


dpM+ey^ 


s  X 

N 


dqM  +  ey^ 


d^M 


since  epq»eyy<f1<X 


p^ 


MALdp/\|  s  f1ALdu/A| 

A  [dp/Aj  s  A  [dy/Aj  by    R4 


A 
dps  du 


(2.5) 


(2.3),  (2.U)  and  (2.5)  imply  Hl^P^  r  by  Lemma  2.2, 
contradiction. 

If  X<M,  let  B=1/X. 

(b)   ^=:»     SpBX+tpcj  s  SyBX  +  t„v 


^PS  Y  *^"^ 


(2.6),  since    SpBX,SuBX  =  0 


(c) 


dpH*-epfl 


N 


duM♦e^^^ 


XdpE+xlep^/xJ  =  Xd^E*X|_ey^/Xj         since    M/X   is   integer 

dpM*X  [^pc,/x|=  d^M  +  X  [e^./x] 

dp  =  du    and    X  lep^^ /xJ^X  [jey^/xj     by    Lemma    2.1 

and    since   0<dp,dt,<L,    and    0<X  I  ep^ /X  j  <epo<M, 
and    0<x  [euy /xj  <€u^  <M. 

(2.7) 
(2.8) 
For    (2.7)     to   he    true   and   net   contradicting   HLtP^^  , 

p   =    u  (2.9) 

Obviously    SpS  s^    and   theretore    (2.1)    implies    tpcj#  tuw       (2,10) 
Rewriting    (2.10), (2.6)    and     (2.8)     setting    p=u,    we    get: 


and 


dp  =    d^^ 


^n  M  ®"^ 


tpc,  ^  tp^ 


(2.11) 


i 


tu 


■i*ir 

a 

IS 


:2 


I 
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tpc,^  tpv  (2.  12) 

anf^        «^p^  «    ep»,  (2.  13) 

(2.  1 1)  ,  (2,  12)    and    (2.13)  irrply  Cl^^^f^    ,    contradiciton. 
Hence    ^,^1?^,        C.^.C. 


The    following    corollary    is    a      direct      consequence      of 
Iheoreir    2.1, 

Corollary    2.1 

An  NxN  omeaa  networl?  can  he  made   to   behave   as   N/M 
independent  MxM  oireqa  networks. 

Proof: 


Let  L=N/M  and  F=  { (i  ,i)  |  0<i<  },  then  HJPl  trivially. 
Hence  it  we  can  pick  N/f  P^'s  (E^  ,9^,  #...,Pm  )  that  are 
omega        passable,  then,  by  Theorein  2.1,         OmT?^  = 

^l.^  ^^M„  '^M,  ••  •  •  '^M^.,    ^  • 
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In  Theorem  2.1,  the  tag  bits  denoting  the 
partitioning  are  the  most  significant  log^L  bits.  In  the 
following  two  theorems,  we  extend  the  result  to  any  set  of 
log  L  bits  in  the  tag  representation.  The  two  theorems  will 
not  be  proved  formally.  However,  the  illustrations  could  be 
easily  generalized  to  formal  proofs. 


Let   us   look   at   (::;.  ,d.)   and   (s.,d.  )    like    the 
following: 

(s,,dj):     (x,X2X3x^x3XgX^XeX3x,o,i,y2y3i^Y5ygy^lgygI,o  ) 
(s.  ,dj)  :     (a,a2a3a^a5aga^agaga,o  '  t  .bgbgb^tgbgb^bgbg  b,^^  ) 
Assume    there   are    loggb    underlined    bits   and    log    M   non- 
underlined    bits. 


«!' 


■j*r 


Assume  all  the  underlined  bits  of  (s  ,d)  satisfy  an 
omega  passable  connaction  F^ r  and  ail  th€  non-underlined 
bits  of  (S/d)  satisfy  an  omega  passable  connection  Fm^ r  where 
k  represents  the  total  numerical  value  of  the  underlined  bits 
of  s.   Then  f(s,d)}  is  passable  by  Hlm* 


Proof : 


Assume  ((s,d)}  is  not  passable   by  O-i^f       then   there 


+  Mote  that  a  connection  can   be   a   broadcastina   function, 
while  a  permutation  cannot. 


exist    (s.,d,)     ,     (s, ,d. )     ,dnd    r    (say,    =6,    in   this      case) 
if  J       J 
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such 


-I  "2  "3^4    ^   5  1^2  ^3^4 


X.  X„  5C, 


that 

dnd  l.V^h^A  =   ^.^2^3^4 


Case    1  :      x,  x^  ^  d,a^ 

In    this    case,    we    have    x^x,.  =  d-a.  ,    and    x.x^     ^     a,  a., 
and    y^  1a    E  ^,  ^Ia*    which    implies  ^^^Pl  ,    ci    contradiction. 


CasG   2:      X,  X4  =  5,  34 


Here    we    have    x,  x^iigi^io^  ^ ,  a4.ag  a,^,    and    s.   ar;d      Sj      will 
have      the      sar^e      k       value.         Now      Xg  x 
a^aga^ag    and    Y^  Yg    =    ^2'-3'    which    contradict    12^,1    P|v| 


^     a^'o.^    and    XgXgX^Xg    zz 


Herce    '"hoorem    2,2    is    proved    by    contradiction. 


Thoerenj_2_^3 

Assume  all  tho  underlined  bits  ot  (s  ,d)  satisfy  an 
omega  passable  peLmatat  icn  Fl  ,  and  all  the  non-underliried 
hits  of  (s,f1)  satisfy  an  omega  passable  connection  F^^f  where 
k  represents  the  total  numerical  valje  of  the  underlined  bits 
of  d.   Then  f(s,d)}  is  passable  byX^LM* 

Proof : 

Assume  {(s,d)}  is  not  passable   by  ^2^_^  ,   then   th<-»re 

exist   (s.,d;)   ,  (s..d.)  , and  r  (say,  =6,  in  this  case)  such 

that     x,X2:<3X^  ^  a^a^aga^, 

and     x_x^x,x  ^x„  x.„  =  a_a.a  a.a.a  ^  , 
5  6  7—8  9-10  —   5  6  7  —  8  9—10  ' 
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and 


l^yzV^U   =  ^.bgtgb^. 


Case    1  :      Z^l^  ^  ^\^4. 

In    this    case,    we   have    x,X4  ^  1|^4#    ^^^   -e-io 
and   1  1.    E  h.]lAt    which    implyi^Li^  ,    a    contradiction. 


=      ^8^10' 


Case    2  :      x  .  x  .    =  -^ .  a  . 


2^3 


This    implies    that    Xg Xg    ^  a 

Also   we    will    get    x.x.x.x,^    =    a.a.  a„a,„   .      Since    P,        is 
J         —1—4—9  —  10     —    — I— 4-- 9— 10  '- 

a      perirutation,      1.  I^IqI.q    =    ^,l)4^ei2io   •       Hence    d;    and   dj    will 


have   the    sam"    k    value.       Now 


X3   ^  ^2^3  '  ^^^ 


X  X  X  X     n 


dga^a^ag,  and  y^  y^  =  b^b^r  which  contradicts  iiith  X^mI  P^k' 
Hence  Theorem  2.3  is  provel. 

It  should  be  noted  while  basically  all  three  theorems 
allow  P^^  '  s  to  be  any  connection  function,  P^  in  Theorein  2.3 
is  the  only  one  that  requires  to  be  a  permutation.  To  help 
appreciate  why  P^  has  to  be  a  permutation  in  Theorem  2. 3,  out 
not  in  Thoerem  2.1,  we  will  show  the  following  example. 


f  x.ample  • 


\or  L=2,  r^  =  U,  we  let  P   be  { (0,  0)  ,  (0,  1)  }  ,   P  ^^^   be 


1-shift  permutation  and  P^   be  a  2-snitt  permutation, 

1)  Let  the  most  significant  bit  be  the  underlined  bit,  we 

have : 


.1  « 


"  .i  *C 


! 


36 


5. 

^2 

^3 

ii 

'^2 

'^3 

0 

0 

0 

C 

0 

1 

0 

0 

1 

0 

1 

0 

0 

1 

0 

C 

1 

1 

0 

1 

1 

C 

0 

0 

0 

0 

0 

1 

1 

0 

0 

0 

1 

1 

1 

1 

Q 

1 

0 

1 

0 

0 

c 

1 

1 

1 

G 

1 

There  is  no  conflict. 

2)  Let  the  middle  bit 

be 

the 

under 

lined 

bit,  hie  have: 

'^. 

^2 

-^3 

d. 

^2 

^3 

0 

c 

c 

C 

0 

1 

r\ 

0 

1 

1 

p 

0 

1 

0 

0 

1 

0 

1 

1 

0 

1 

c 

0 

0 

0 

c 

0 

1 

1 

0 

c 

0 

1 

1 

-1 
1 

1 

1 

0 

0 

0 

1 

0 

1 

0 

1 

0 

1 

1 

There  are  many  conf lie ts (e .q. ,  between  (0,1)  and  (4,2)). 
1)  Let  the  least  significant  bit  be  the  under  lin'^d  bit,  we 
have: 
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c; 

^2 

^3 

Q 

0 

c 

0 

1 

0 

1 

0 

0 

1 

1 

G 

0 

0 

C 

0 

1 

0 

1 

G 

0 

1 

1 

0 

<^.    ^2    ^3 


0  1      0 

1  0      0 
1  1       0 

coo 

1  0 

1  1 

0  c 

0  1 


There    are    irar.y    conflicts  (e,  g.  ,    between    (0 


2)  and  (4,1))., 


This  partitioning  property  of  tht'  omega  network 
proves  to  he  vital  for  the  efficient  handling  of  many 
algorithms,  especially  the  Recurrence  Solvers,  as  shown  in 
Chapter  3  of  this  thesis. 


II 


''Clil 


4 


•0 

i 
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2.2.3  CingcjaSroddcast  Iheocems 

Theorems  10  and  11  of  [^]  descrite  broadcast inq 
functions  for  sqaare  blocks.  In  this  section,  we  are  going 
to  extend  them  to'  3-dimensional  arrays,  not  necessary  of 
-^qual  size  edges. 

Ke  use  the  notation  (k  ,x  ,  y)  <a  ,  b,c>  to  denote  ele:nent 
(k,K,y)   of   an  axbxc  array.   Here  0<k<a,  0<x<b,  0<y<c.   Also 

(k,x,y)  <<i,b,c> >  (*,  x,y)  <a,b,c>  symbolizes  the   mapping   ot 

element  (k,x,y)  to  eleirents  (0  ,x,y)  ,  ( 1 ,  x  ,  y)  ,  .  .. .  (a- 1 ,  x,y)  . 

Now  we  can  show  six  extensions  of  the  broadcast 
theorems. 

'^or  constant  k  and  for  all  values  of  x,y,*, 

(i)  Q^^^    t  ((k,x,y)  <a,b,c>  --->(*,  x ,  y)  <a ,  t,  c>} 

(ii)  n^bc  t  (<k,x,y)<d,l:,c>  --->  (  x  ,  *  ,  y)  <  h,  a,  c>} 

(iii)n3^J^  t  r(k,x,y)  <a,b,c> >(x,  y  ,  *)  <  b,  c,  a  >} 

(^v)    Habc  t  f  ('^.,x,y)  <a,b,c> Xy ,  x,  *)  <c,  fc,a  >}       iff    a>c 

(V)      r^abc  ^  f(k,x,y)<a,b,c>  —->{*, y,x)  <a,c,l->} 

(vi)    Habc   ^  f  (k,x,v)<a,h,c>  --->  (y  ,  *,x)  <c,  a,  h>} 


Proof: 


Let    a'  =  loq(a),    b'  =  log(h),    c'^log{c),     and    r'=log{r) 


C^)      £}       1'f(k) >(*)}#    d       1-tc-many      broadcasting      function, 

and     X^bc   t{(^#Y)<brC> >  (x  ,  y)  <b,c>)  ,    an    identity. 

Therefore,  (i)  is  proved  by  applying  Theorem  2.1. 
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(ii)       First    .*^    will    prove:    12  gb  t  f  C^/X)  <a,  b> >  (x,  *)  <b,a>}  . 

Assume  it  is  falsp,  it  implies  that  there   exist   Xf,X;,r,D,q 

_  r 

such    that    x.^K.f    kb*-x.    =    kh+x.    and    x.a  +  p   E   X;a  +  q. 
10  I    TT  J  '  ab     J 

If  c>b,  we  have  x.  ^  x,  and  x-  =  x: ,  a  contca diction. 
If  r<b,  the  most  significant  (a'+b'-c')  bits  cf   Xj   will   b-^ 
the   same   as  that  of  xj,  while  the  least  significant  c' 


bits  of  X.  will  be  the  same  as  that  of  x:.   Since  x ,-  and 

X:  have  only  b'  bits  each,  then  x.  =  x.-  which  implies  a 

contradiction. 
Hence  ^ab^  ( (k,  x) -  — >  (x  ,  *)  }  . 
Using  Theorem  2.1  agairi,  noting  thatQ^tf(Y) >(y)}r  (ii)  is 

proved  . 

(iii)  Proof   ot   (iii)    is    similar    to    that    of  fl^^f 
f  (k,x)— >(x,*)}  in  (ii). 


(iv)   For  this  case,  we  can  represent  the   tags   (s.  ,d.) 


(s  ;  ,d  •.  )  as  follows: 

a'     b  • 
s .   ,   k   ,   X . 


c' 

Y; 


J    L 


c 

Y 


X  . 

I 


a* 

P 


and 


*f^ 


I 


iii 


(1 


'J   L 


J   I  '-i  I 


~t — 

r' 


a«  <-b'+c'  -r ' 


Fach  tag  is  divided  into  3  parts  of  lengths   a',   b'   and   c' 
respectively. 


If  a>c,  assume  Habc^  f  C^/XfY)  <d,b,c> >  (y  ,  x,  *)  <c,  b,  a>}  . 

This  irrplies  that  there  exist  s.  ,d.,s.,d',   and   r   such 

I        I       J       J 

that      S;   5E    s;  ,       and      the    least    significart    r'    bits   of    s. 
'  abc    J  ' 
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(v)    rhir,  case  can  be  represented  pictoriallj  as: 


a'     b'     C 


b» 
X  . 


d  : 


d»+b«+c'-r 


Tt  b>c,we  can  pick  (s.  rd.  )  F,  (Sj,dj)  such  that  p=q,  yi=Yj' 
and  the  ir^ost  significant  (b'-c')  bits  of  Xj  S  Xj  equal, 
with      the      least      significant      c*       bits      unique.  Then 


i   s : .       By    letting    c=c,    we    can    see    that   s,  =    Sj.       Also 


abc 


we  have  the  most  significant  [ a' +c' + ( b' - c')  ]  bits  of   dj 

F,      d;  being  equal,  which  is  equal  to  a'  +  t'*c'-r'.  Hence 

we    lust    show    that   ^2  abc  ^  fC^ .  x,  y  )  <a,  b,  c>    > 

(*r  y#x) <a,c,b>} . 


v. =y.   and 


If  c>b.  We  pick  (s.,d.)  &  (s-^d:)  such  that   p=q, 

X.  ^y-  .  Then  S;  ^  S;  .  We  also  pick  r  =  c,  then  s-  E  s-  . 
The  number  of  leading  bits  of  dj  Fj  qj  that  are  the  same 
=  a'+r'  >a'+b'  =a • +b '  +  (c ' -c • )  .  Hence  we  can  see  that 
QQ^^^f{a,^,y)<^.^,c>   — >  (*,y,x)<a,c,b>}. 

(vi)   We  can  represent  this  case  as: 


b' 

X  • 

I 


I    ^'   I     L 


^1 


X  . 


1j 


a'+b'+c»-r 


If  a  •  +  L' +c' >?c' +a'  (i.e.   b'>c'),  we  pick  y.  = y • r  p=q  and   the 
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and   Sj   are    equal,    while    the    most    siqnificant 

(a • fh  •♦c'-r' )  bits  of  d;  and  dj  are  equal. 

If  r>c,  then  from  s.  and  s .  ,   we   can   see   that   Yj  ~Vj  • 

Also,   the   least   significant   (r'-c')  c£  x.  and  X:  are 

equal.   From  d.   and   d ; ,   we   can   see   that   the   most 

siqnificant   (a ' +1 • +c ' -r' -c' )   bits   of   x.   and   x:  ace 

equal.   These  add  up  to  a'+b'-c'  I'its  of  x.  and  x,,   and 

is   qreater  than  or  equal  to  b'  if  a>c.   So  x. =X;.   This 

contradicts  s.  ^  s;. 

a  be 
If  r<c,  we  have  the  nost  siqnificant  (a*  ♦b* ♦c' -r ' )   bits 

of  Vj  and  Y;  equal,  which  is  qreater  than  a'+b'  and  thus 

greater  than  c*.   So  y. =7..   From  an  arqument  similar  to 

that   in   the  paragraph  above,  we  have  Xj =Xj .   This  also 

contradicts  s.  ^  s-. 
'  ate  ^ 

Hence  if  a>c,  Og^^^t  f  (k  ,  x,  y)  <a  ,b  ,0  --- >  (y  ,x,  *)  <c,  b,  d>}  . 

f  a<c,  assume  /^g^j,t  (('^/X,y)  <a,b,c> >  (y  ,x  ,  *)  <c,  b,a>}  . 

If  ab>c,  we  pick  (s.,d.)  and  (s,,d;)  such  that  y,  =y;  and 

p -q .   Then  x.=Xj  except  for  the  most  siqrif leant  (c'-i') 

_  r. 

bits.   Then  for  r'=a'+b',  wo  have  s.  =  s;   and   d;  =  d-,  , 

I    y.       J  '   abc     J 

and      since      s.    ^    s- ,       we      show    that  Qa^c  "^  f  C^*  x,  7)  <d,b  ,  c> 
abc 

> (y,x, *) <c, b,a>} . 

If  ab<c,  we  simply  have  to  pick  ''j-^j  ^^^  Y]~Y^  to 
arrive  at  the  contradiction.  So  now,  we  get 
Qabctnk,x,y)  <a,b,c>  -  — >  (y  ,  x  ,  *)  <c,  b,  a>}  . 


en 


I 


IS 

•0 

i 
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last  (b*-c')  bits  of  x  5  x  equal.  Then  there  will  be 
conflict  if  r'=b«.  This  implies  Q  abc  ^H^  ,x  ,y)  <a,b,  c> 
-  — >     (Y,*,x)  <c,a,]:>}  . 

If  a'  +  b* +c' <2c'+a»  (i.p.  b'<c'),  we  simply  pick:  Y;  =yj  *  P=^ 
and  x.=x-,.  Then  for  r<c,  we  will  have  conflict,  which 
iirplies  ^g^jj^t     f  (K,  x,  y)  <a,  b,  c>    -  — >  (y,  ♦rX  )<c,a  ,b>}  . 


These  broadcasting  theorenrs  are  alsc  essential  in 
ostablishinq  th^  rocurr^nce  alcjorithtrs  in  Chapter  3  ani  are 
some   of    the   more    important    properties   of    the    cmoqa   network. 
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2 .  2 .  t|  General  Admissibility 

It  is  veil  known  that  the  omega  network  can  only  pass 
a  sirall  fraction  of  the  total  number  of  M 
permutations (N** (N/2) /N!) •  To  improve  the  permutation 
ability  and  to  understand  better  the  relationship  between 
permutations  and  the  owega  and  inverse  omega  networks,  we 
would  like  to  extend  some  of  the  results  of  Pease[16]. 


Without  relabelling,  the  indirect  binary  n-cube  array 
described  in  [16]  is  actually  an  inverse  omega  network[4] 
after  rearrangement  of   the   switches.    He   can   extend   his 

results   to   the  omega  network  in  a  very  simple  manner.   Let 

n-i      n-2  + 

the  index  p  be  represented   as   (P,  2   +P22    •»-....?„,,  2*p^) 

instead  of  (P|+2p2  +  ...2   P^)  »  as  in  Formula  1  of  [16].   Then 

we  can  use  exactly  the  same  theorems  as  in  [16],   except   now 

we  should  note  that  the  index  bits  are  reversed. 


iZ 


Let  X  and  y  be  expanded  in  binary  notation  as 
(X, ,X2 ,. .. ,x^)  and  (y, #y2 # • • • ^y^)  with  x,  and  y,  being  the 
most  significant  bits,  x„  and  y^  the  least  significant.  The 
function  describing  a  permutation  P  can  be  written  as  a  set 
of  functions 


Y;   =  P;  (X,,X 


#X^)  . 


(2.1U) 


+  This  notation  is  consistent  with  other  sections   of  this 
thesis. 
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The  principal  theorem  for  the  omega  network  will  then 


be: 


Theoreir  2.  U 

F  is  adirissible  by  an  omega  network  if   and   only   if 
the  functions  (2.1U)  defining  P  can  be  written  in  the  form 


Yf  =  xj®f;  (y,  ,...,yi.,  ,Xf^ 


^r^) 


(2.15) 


for    1<i<r.      ^  is   the    'exclusive   or*    operation. 


It  is  a  ref oriTulation  of  Theorem  2  in  [  ^  ].  However, 
it  applies  only  to  perirutations.  No  broadcasting  function  is 
considered. 

A   siirilar    theorem   for    inverse  omega    network   will   be: 

Thcorerr    2,?. 

F  is  admissible  by  an  inverse  omega  network  if  and 
only  if  the  functions  (2.1U)  defining  P  can  be  written  in  the 

form 


y>  =  Xj^f ,'  (y^,  ...,y,-  +  ,  rX,_,  ,...,x,) 


for  1<i<n.   ®  is  the  'exclusive  or'  operation. 


(2.16) 


Using  the  new  rotation,  we  will   redefine   lower   and 
upper  triangular  permutations  as  follows: 

Definition  2.2 

A    permutation    is    lower    triangular    if     the 


functions (2. 15)  defining  P  can  be  written  as 


Yi  =  x^®c 

y-,  =x;®f  i  (y,  #y2#-»«»yi_,  ) 


(2.17) 


where  2<i<n,  c=0  or  1. 
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Definition . 2»  3 

A    permutation    is    upper   triangular    if     the 
functions (2. 15)  defining  P  can  be  written  as 


Y;  -X.®fj(X|^,  ,X  ,.,.2  »•  •  •  **n^ 

Yn   =   Jfn^c 
where    l£i<n-1,    c=0   or    1 . 


(2.  18) 


All  lower  and  upper  triangular  permutations  are 
passable  by  omega  networ)c.  Me  will  now  present  the  following 
two  lemmas  that  show  that  they  are  also  inverse  omega 
passable. 

Lemma  2. 3 


.4  0  J 


A  lower  triangular  permutation  can  also  be 
represented  in  the  form  y-  =  x.  ®g  j  (x,  ,  ,. .  x  ;_j )  #  and  is 
thus  inverse  omega  passable. 


Proof: 


Assume 


y.  =  x.®gj  (x^,.  ..,x  j_,  )  for  i=1,..,. 


k. 
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=  »K4l®f  K+,    (X^®C,X2®g2Ui)...wX^,®gK(X,..X^.,    )) 

Since   it   is  tru^  for      k=1,      this     Lemma     is      thus      proved      by 
induction. 

Lemma    2«U 

An  upper  triangular  permutation  can  also  be 
represented  in  the  form  y.  =  x  (S^g-  (y^*  •• -y  j^^  )  #  and  is 
thus  inverse  omega  passable. 


Proof: 


The  proof  is  similar  to  that  of  Lemma  2.3. 


The  followinq  two  theorems  are  taken  out  of  [16],  and 
will  be  presented  without  proofs. 

Theorem  2.6  f  Pease] 

The  set  of  admissible  lower  triangular  permutations 
is  a  group  under  composition  of  maps. 

Theorem  2.7  f  Pease  ] 

The  set  of  admissible  upper  triangular  permutations 
is  a  group  under  composition  of  maps. 

Pease  showed  that  shifts  are  lower  triangular 
permutation  in  his  context,  which  is  upper  triangular  in  our 
notations.   We  are  going  to  show  here   that   the   odd-ordered 
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vector  unscramblinq  permutation  is  also  upper  triangular. 


We  first  let  the  binary   representation   of   the   odd 

order,   p,   be   (Pi  P2 Pp-i  ^^'    ^^  ^^^   source  index  is  k, 

then  the  destination  index  will  be  y=p*x.   The  values  of   the 
Vj  • s  are  then 

Yn-2"^Ti->®Pn-2^n®Pn-i'^n-,®S„_2  (x^_,  ,X^) 


n-i 


where  Sj  is  the  carry  from  the  lower  ordered  bits  to  the   ith 
bit. 

It  is  immediately  obvious  that  this  is  an  upper 
triangular  permutation  defined  in  Definition  2.3.  Hence  by 
Theorem  2,1,  we  can  see  that  all  compositions  of  shifts  and 
odd-ordered  unscrambling  permutations  are  omega  passable. 

Defining  x=(x,  x^  ...x„)  ,  and  y=  (y  ,  J^  •••Yn^  "  ^ 
permutation  is  linear  if  there  exists  an  n  x  n  nonsingular 
binary  matrix,  P,  such  that 


y  =  P  •  X 


(2.19) 


f 


^: 


C5 


It 


1^ 


\i 
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Extendinq  Pease*  resalt  to  our  notations,  a  linear 
permutation  is  oinega  passable  if  P  can  be  decomposed  into  the 
matrix  product  LU,  vhere  L  is  a  lover  unit  triangular  matrix 
and  0  is  an  upper  unit  triangular  matrix.  (Dnit  means  that 
all  coefficients  on  the  main  diagonal  are  ones) .  By  analogy, 
a  linear  permutation  is  inverse  omega  passable  if  P  can  be 
decomposed  into  OL. 

One  important  result  from  [ 16  ]  is  that  any 
nonsinqular  P  can  be  decomposed  into  L,0L2«  This  implies 
that  P  can  be  decomposed  as  L,0  and  Lq*  So  P  can  be  passed 
by  two  omega  passes.  It  can  also  be  decomposed  as  L|  and 
UI2.  Hence  it  can  be  passed  by  two  inverse  omega  passes. 
This  is  a  very  significant  result  because  it  increases  the 
permutation  ability  of  the  omega  network.  The  perfect 
shuffle  permutation  is  a  good  example.  Although  the  omega 
network  is  made  up  of  stages  of  perfect  shuffles,  a  perfect 
shuffle  permutation  cannot  be  passed  by  a  fixed  log  N  stages 
omega  network.  However,  it  is  a  linear  permutation.  So  it 
can  be  passed  by  two  omega  passes. 


A  diagram  showing  the  permutation   abilities   of   the 
omega  and  inverse  omega  networks  is  shown  in  Figure  2.8. 
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Figure  2.8  Permutation  Abilities  of  Omega  Network 


and  Inverse  Omega  Network 
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2.  3  Batcher_Net  wock_and„,  the  5huf  f  le_Cgnnection 

Cnc  network  that  bears  a  great  reseirblence  of  the 
omeqa  network  is  the  Batcher's  merging  network.  The  only 
structural  difference  is  that  instead  of  using  tag  bit 
comparison  at  each  of  the  switching  elemerts,  the  Batcher 
merger  compares  the  magnitude  of  the  two  whole  tag  words  at 
each  switch  t-o  dotetrr.ine  which  of  the  two  output  ports  to 
select . 

Pefore  we  continue  the  discussion,  w€   first   define 

the  order  set  of  a  set  of  N  elements  as  the  relative  ordering 

of  the  elements,   "^or  example  the  order  set  of   (8,12,17,3,9) 
is  {1,3,U,n,2). 

It  can  te  observed  that  the  Batcher's  merging 
ilqorithm  for  a  set  of  distinct  elements  is  equivalent  to  the 
omega  tag  routing  algorithm  for  the  corresponding  order  set. 

It  shoul-i,  then,  be  obvious  that  the  cmoga  partition 
theorems  also  applies  to  the  Batcher  merger. 

Now  we  can  deduce  an  alternate  proof  of  Stone's 
implementation  [R]  of  the  bitonic  sorter  on  the  perfect 
shuffle    network. 


The  basic  idea  of  an  NxN  Batcher  bitonic  sorter  (i'l 
being  a  power  of  2)  is  quite  simple.  Given  two  sorted 
sequences  of  l^inqth  L  each,   if   the   first   sequence   is   in 
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ascending  order  whilo  the  second  sequence  is  in  descendin>j 
order,  then  the  ?Lx2L  Eatcher  merger  will  sort  the  bitonic 
sequence  tor'^ed  by  the  juxtaposition  ol  the  two  sequences.  A 
bitonic  sorter  consists  of  loq  N  stages  of  meiging  nettfor"'CS, 
where  stage  i  ( ''  £  i  S  lo92'^)  consists  of  N/2'  bitonic 
mergers  of  size  2'  x  2'.  (Some  switching  elenents  will  need 
to  have  their  outputs  reversed  in  order  to  produce  the 
required  descending  order.)  By  the  extexision  of  the  omega 
partition  theorem  we  can  use  an  NxN  Bitonic  merger  to 
'simulate'  N/2  bitonic  mergers  of  size  2'  x  2',  by  setting 
the  switches  in  the  first  (log^ N-i)  columns  straight  through. 
Hence,  the  sorting  algorithm  in  [  8  ]  iirpienren  ts  the  Batcher 
bitonic  sorter  on  a  perfect  shuffle  network. 


C) 


C3 
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2 .  U  Benes^Network  andthe^Shuf f le,  Connection 


It  can  be  proved  that  the  binary  Beres  network  of 
size  NxN  (whefe  N  is  a  powec  of  2)  is  equivalent  to  a  cascade 
of  an  inverse  omeqa  network  and  an  omega  network  ,  with  the 
middle  two  columns  of  switching  elements  collapsed  into  one. 
The  best  known  control  time  for  the  Benes  network  [17] 
requires  on  the  order  of  Nxlog^N  operations.  However,  the 
control  time  for  an  omega  network  is  0(log2^N),  so  for  all 
omega  passable  or  inverse  omega  passable  permutations,  the 
control  time  is  only  on  the  order  of  log2N. 
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Sinc3  many  connection  networks  ire  sh uf f le-fcasod,  if 
we  build  a  one  stage  perfect  shuffle  network  to  interconnect 
all  the  proc'-»ssors,  we  can  simulate  any  of  these  networks  by 
cycling  sufficient  nuirbor  of  times  through  the  network. 
lloroover,  thn  complexity  of  the  network  will  only  be  0(M), 
which  will  be  a  great  deal  totter  than  that  of  other 
networks. 


One  o^'vious  shortcoming  ot  the  omega  retwork  is  its 
inability  to  pass  some  permutations.  In  this  section,  we  are 
searching  for  the  best  strategies  to  pass  any  permutation 
through  a  one  stage  rocyclic  perfect  shuffle  network.  By 
recycling  a  one  stage  perfect  shuffle  network  a  sufficient 
number  of  times,  we  hope  to  oe  able  to  pass  any  general 
permutation. 

Lang  [13]  proposed  to  use  queues  at  the  outputs  of 
the  switching  elements,  and  then  cycle  the  network  for  as 
many  times  as  it  i.s  needed  for  each  ot  the  log  ^  N  steps. 
Following  this  strategy,  the  number  of  shuffle  cycles 
required  in  the  worst  case  is  found  *-o  be 

=2j2N  -3       for  log  N  being  odd, 

=3Jn  -3        for  loqjN  being  even. 
The  length  of  the  queues  can  grow  to: 


for    log    N    being    odd. 


a    * 
Ml 

ill 
111 

I 


for  log  N  being  even. 
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Lang's  algorithm  is  good  in  general.  However,  the 
building  of  two  0(jN)-long  queues  into  each  switching  element 
certainly  complicates  the  design  of  the  switching  elements. 
Hence  in  another  possible  strategy,  we  prooose  routing 
algorithms  that  do  not  reguire  to  use  queues  at  the  switches. 
The  routing  strategy  is  similar  to  that  used  in  the 
Destination  Tag  Method.  Every  input  port  will  generate  a 
logjN  bit  tag  representation  of  its  destination  and  push  it 
(together  with  the  data)  through  the  network.  The  switching 
functions  of  the  switchirg  elements  are  still  the  sa.ne. 
Flowever,  in  case  of  conflict  at  any  switch/  one  input  will  be 
honored  while  the  other  has  to  be  switched  to  an  undesired 
output  port  and  restart  from  the  most  significant  tag  bit. 
At  any  given  stage,  the  bit  positions  to  be  examined  for  each 
tag  may  le  different.  So  we  need  a  bit  count  associated  with 
each  tag  to  indicate  which  bit  position  will  have  to  be 
examined.  Aft-^r  the  last  tag  bit  of  each  input  has  been 
examined,  it  will  be  stored  away  in  a  register  and  taken  off 
the  network. 

The  conflict  rf'solution  can  be: 

a)  gatp  straight. 

b)  honor  the  input  furthest  away  from  its  destination. 

c)  honor  tht^  input  nearest  to  the  destination. 


Sirulatiori  usir.g  random  permutations  shows   that,   on 
the   average,   all   three   resolutions  are  iu£t  about  equally 
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effective. 


Instead  of  restarting  from  the  most  significant  tag 
bit  for  the  data  at  the  wrong  output  port,  we  can  use  a 
built-in  table  to  determine  which  bit  to  examine.  In  the 
discussori  below,  we  let  n  equal  to  loy^N, 

For  the  destination  tag  method,  assume  tnere  is  no 
conflict  at  any  stage.  We  can  observe  that  at  stage  k  (k=2 
to  n) #  the  data  word  whose  destination  tag  is  d,d2....dn  will 
be  at  switching  eleirent  i  (with  binary  representation 
i|i2»»««in-|)*  where  d,d2...d|^   =in-i<»»»»in-i  • 

Hence  by  comparing  the  destination  tags  with  the 
switch  tox  numbers,  wt;  can  find  out  how  far  a  data  word  with 
certain  destination  tag  is  frcir  its  destination.  A  more 
useful  information  is  which  tag  hit  d,^(1<K<n)  shall  we 
examine  at  that  particular  switch  for  further  switching. 
This  number  k  is  doteritined  as  the  maximum  nucrber  of  trailing 
digits  ot  the  switch  number  i  that  match  with  the  leading 
digits  of  the  specified  destination  tag.  These  k  values  can 
be  input  as  a  built-in  table  in  a  HC"^.  For  example,  for  an 
N=8  network,  the  table  will  be: 


m 


,|,--''VIJ 
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De- 

stination  Taq 

(1st 

(log 

N- 

-1)  bits) 

00 

CI 

10     11 

CI 

3 

2 

1      1 

Switch 

0^ 

1 

3 

2      2 

NuiTiber 

10 

2 

2 

3      1 

Urif octunataly,  rio  theoretical  hound  has  been  found 
foe  any  of  these  two  startup  irethods.  However,  the  two 
methods  have  been  simulated  for  many  randorr  permutations  and 
various  network  sizes.  The  siirulatod  averages  are  tabulated 
helow: 


Network  size 

s 

16 

32 

GU 

128 

256 

Complete  Restart 

U.2 

7.2 

11.6 

17.  0 

24.5 

32.7 

Table  Lookup 

■ 

3.9 

6.5 

10.0 

1U.7 

20.2 

27.9 

Figure  2."  compares  these  two  restart  methods  with 
Lang's  nothod.  The  table  lookup  nrethod  definitely  has  an 
advantage  over  the  cotrplete  restart  method,  but  is  also 
slower  than  Lang's  method. 
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ii       NETgOPK_LfTILI2ATIQN    IN  ,  F&RALLEL    PROCESSING  ,  SYSTEMS 

3»  "^  Introduction 

To  build  a  rreaningful  processing  system,  we  have  to 
be  able  to  handle  efficiently  most  of  the  application  demands 
of  the  users.  In  this  section,  we  will  investigate  the 
alignment  requirements  of  some  common  operations  or 
algorithms  and  with  what  efficiency  they  can  be  handled  by 
the  alignment  networks. 

Array  operatiors  are  probably  the  most  common  type 
of  operations  found  in  ordinary  Fortran  programs  and  they 
have  the  most  potential  for  high  speedup  and  efficiency.  So 
the  most  important  criterion  of  a  good  parallel  processing 
system  is  the  efficient  handling  of  array  operations.  Budnik 
and  Kuck  [19]  and  Lawrie  [U]  discussed  ways  of  organizing  the 
memories  to  allow  conflict-free  access  to  various  slices  of 
arrays.  Linear  skewing  is  a  standard  technigue.  However, 
the  data  output  will  sometimes  form  a  p-ordered  vector,  which 
cannot  be  unscrambled  by  means  of  a  simple  shifter,  [4] 
discussed  the  alignment  reguirements  for  some  of  the  most 
common  types  of  array  accessings. 

In  ordinary  programs,  operations  that  are  not  scalar 
nor   array   operations   very   likely   belong   to  the  class  of 


"^  What  we  mean  by  array  operations  are  the  obvious  type  of 
vector  operations  found  in  programs,  not  the  type  we  obtain 
by  carefully  rearranging  the  operation  seguences  of  a 
particular  algorithm. 
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recurrence  operations.  Recurrence  operations,  if  not  treated 
properlvr  will  degrade  a  parallel  processing  system  to  a 
serial  machine.  Kogge  and  Stone  [20],  Heller  [21],  and  Chen 
and  Kuck  [22]  have  shown  various  algorithms  to  speed  up 
recurrence  operations.  Section  3.2  will  discuss  the 
adaptation  of  various  recurrence  solving  algorithms  onto 
parallel  processing  systems.  It  will  be  shewn  that,  with 
careful  planning,  the  alignnrent  reguirement £  can  be  greatly 
simplified.  flenre,  we  would  not  n.Be'i  to  use  a  full  crossbar. 
Insteadf  a  simpler  alignment  network,  such  as  an  omega 
network,  will  suffice. 

The  adaptation  of  recurrence  operatiors  onto  parallel 
processina  actually  serves  as  a  good  example  of  how  a  wall 
known  comoutation  algorithir  can  he  tailored  according  to  the 
limited  number  of  available  perirutations  of  the  alignent 
network  to  minimi7e  alignment  time.  In  the  extreme  cases, 
the  alignirent  network  may  have  only  limited  number  of 
connections  (lik^  the  Illiac  IV  shifter  cr  a  one-stagci 
perfect  shuffle  network).  To  obtain  any  general  permutation, 
the  network  has  to  be  recycled  many  times.  For  example,  the 
one-stage  perfect  shuffle  network  described  in  Section  2.5 
may  reguire  0  (Jn)  aiignnont  steps  before  we  can  start  on  a 
processing  step.  By  carefully  rearranging  some  of  the 
operation  seguences  in  normal  algorithms  and  by  assigning 
intermediate  storage  patterns  in  a  deliberate  fashion,  we  can 
hopofully  reduce  the  number  of  alignments  per  processing  step 
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down  to  a  con-Gtant  (not  dependent  on  H)  .  Pease  [10]  and 
Stone  [8]  showed  how  the  Fast  Fourier  Transform  can  be  done 
efficiently  on  a  multiprocessing  system  interconnected  with 
the  perfect  ?5huffle  connection.  In  Section  3.3  we  are  going 
to  show  how  matrix  multiplication  can  be  done  in  a  more 
efficient  way  in  a  multiprocessing  system  with  a  certain 
class  of  conn«^ction  networks.  The  number  of  alignment  steps 
is  shown  to  be  reduced  ty  a  factor  of  Jn  or  IcgoN. 


The  algorithms  described  in  this  section  are  i^i  such 
(ietails  that  they  can  be  easily  micropro crammed  into  the 
respective  parallel  processing  systems.  The  intermediate 
storage  and  alianment  patterns  are  all  clearly  specified. 
Masks  are  needed  occasionally  to  prohibit  some  processors 
from  doing  the  prescribed  operations  at  some  steps. 
Throughout  this  section,  we  are  considering  parallel 
orocessinq  systems  structured  like  that  in  Figure  3.1.  The 
central  control  unit  is  not  shown  in  the  figure,  but  is 
actually  the  master  unit  ot  the  array  of  processors.  It 
sends  the  ir  icroinstruc  tions  to  all  the  processors  together 
with  the  masks.  Each  processor  will  address  only  its  own 
memory.  If  the  data  words  obtained  need  to  be  sent  to 
(different  processors,  they  will  he  gated  to  the  Alignment 
Send  Register  (ASR)  .  After  the  roc,uired  alignment  is  done, 
thf>y  will  he  returned  to  the  Alignment  Peceive  Register  (AR  R)  . 
fcith  this  archit.^cture,  wo  can  align  internal  registers  as 
well  as  memory  r<-^gisters. 
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Figure  3.1   A  Parallel  Processor  System  Configuration 
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3.?  Adcipt.ation,  of  _Recai:i:ence_  Solvers 

Chen  and  Kuck  [22]  provided  many  good  algorithms  to 
handle  recurrence  systems.  To  actually  implement  these 
algorithms  on  a  parallel  processing  system,  ore  would  require 
some  cateful  partitioning  of  the  recurrence  system,  and  a 
qood,  uniform  way  to  allocate  the  initial  and  intermediate 
data  so  as  to  minimize  the  data  routing  time  and  the  amount 
of  intermediate  storage  space. 

The  solution  of  a  ?<rt,m>  recurrence  system  is 
actually  equivalent  to  the  solution  of  a  bandod  unit  lower 
triangular  matriv  system  with  matrix  size  n  x  n  and  the 
number  of  nonz^^ro  bands  =m*-1-  In  general,  he  have  to  solve 
(3.1)  for  X  to  gt t  the  recurrence  results. 


A  X 


(3.  1) 


where  A  is  a  lower  triangular  matrix  with   1's   on   the   miin 
diagonal  ari  m.  more  nonzero  subdiagonals. 


[23]  and  [ 2U  ]  reorganized  some  of  the  recurrence 
algorithms  into  partitioned  matrix  notations  to  simplify 
understanding.  According  to  the  number  of  processors 
available  and  thi-^  values  of  m  and  n,  we  have  to  use  different 
recurrence  solving  algorithms  for  higher  efficiency.  In 
general,  there  are  three  major  algorithms  to  handle 
recurrence  systems.  Ihe  first  algorithm  uses  a  limitei 
number   of   processors,  and  evolves  from  [23]  and  Algorithm  5 
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of  [25].  The  second  algorithm  assumes  the  presence  of  a 
large  numjuer  of  processors,  but  will  do  the  folding  when  the 
number  of  available  orccessors  is  less  than  the  upper  bound. 
It  evolves  from  [2^4]  and  Algorithm  2  of  [22].  The  third 
algorithm  is  similar  tc  the  second  algorithm  €xcept  it  uses  a 
less  parallel  method  in  solving  the  small  full  recurrance 
systems  in  the  initialization  stage.  The  number  of 
processors  used  will  be  between  that  of  the  first  and  the 
second  algorithms. 

Given  D(the  number  of  processors),  and   values   of   m 
and  n,  the  execution  time  for  the  first  algorithm 

=  ^mn  (m+2) /p  +  log^(p/m)  =>  (m  +^m/2+ 1 ) -ni'^-9m/2-2  ,     for  p<n. 


=  2m  (m  +  ?)  +loq2(n/'n)  *  (m^  +  3F/2+  1)  -m^-9m/2-2. 


The  execution  time  for  the  second  algorithm 
=  2ir.log  (n/m)  +4rr  , 
=  (Umlog (n/m) +f m) mn/ {2p)  , 
=  UFlog  (n/m)  *-6m. 

The  execution  time  for  the  third  algorithm 
=  logon  (2  +  log,in)  -log.m  (log„m+ 1) /2  , 


for  p>n, 

for  p=mn, 
for  p<mn/2, 
for  p>mn. 

2 
for  p>m  n. 


■eMt 
1 


.1,1 

C3 


n 


=  (loq  n  (2  +  log  m) -loggm  (log2rr+ 1 ) /2)  m  n/p  ,     for  p<m^n. 


We  can  decide  which  algorithm  to  use  ty  coinparing  the 
above  tim.e^,  given  rr ,  n,  and  p.  A  diagram  showing  what 
algorithm  to  use  tor  m=8  is  shown  in  Figure  3.2.  l^e  can  see 
that  if  we  have  many  processors,  we  would  like  to  use  the 
second  algorithm;   if  we  have  a  little  bit   less   processors. 
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the   third   algorithm  will  be  more  efficient;   and  if  we  only 

have  a  limit'^'d  nuirber  of  processors  we  would  like  to  use   the 

first   algorithm,    k  similar  pattern  can  be  observed  for  m's 
of  other  values. 


nfi=8 
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c 

a  choose  the  first  algorithm. 

1;., choose  the  secord  algorithm. 

c  choose  the  third  algorithm. 


Figure  3.2   Choice  of  Pecurrence  Algorithms  for  m=8 
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3.2.  1  Using  a  Limited  Number  of _Processors 

The  original  algorithm  can  be  found  in  [23],  The 
following  algorithm  shows  how  it  can  be  adapted  to  a  parallel 
processinq  system. 


We  will  first  describe  how  the  input-output  arrays 
ind  the  intermediate  arrays  are  stored  in  the  parallel  memory 
systein.  The  non?ero  elements  of  the  A  matrix,  other  than  the 
main  diagonal,  are  stored  in  a  section  of  the  memory  called 
L.  The  processors  and  the  memories  are  broken  into  k 
partitions  (  numbered  0  through  (k-1))  cf  size  m,  where 
k=p/F.  The  ith  off-diagonal  element  of  the  (i+1)th  row  of  A 
will  ba  stored  in  the  (m-;i)th  memory  of  the  j_ik/nj  th 
partition  with  relative  dddress=  i  irod  m.  The  £  vector  is 
stored  in  the  memory  section  called  f.  The  f  vector  is 
broken  into  k  subvectors.  The  i-th  element  of  the  j-th 
suhvector  will  be  stored  in  the  (i  irod  m)th  memory  of  the  jth 
partition  with  relative  address- [i/mj .  G  and  H  are 
intermediate  storage  sections.  The  result  vector  x  is  stored 
in  two  memory  sections,  v  and  y.  v  and  y  together  can 
actually  he  th^  alias  of  memory  section  f,  and  the  x  vector 
is  stored  in  exactly  the  same  way  as  the  f  vector.  Section  y 
is  the  alias  for  the  first  (n/p-1)  locations  cf  section  f  and 
v  is  the  alias  for  the  last  location  (relative  address=n/p- 1) 
of  f. 


I 


Its 

■  ** 


«2 


For  maximum  efficiency  of  this  algorithm,  p  should  be 
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less  than  or  equal  to  n. 

In  the  following  algorithm. 

PPN  =  processor  identification 

number 

(0<PPN<p)  , 

FA  =  relative  address  within  a 

memory 

section. 

P.CC  =  accumulators  in  the  processors. 

P1-t^l   =   the   three   internal 

registers    in 

the 

processors . 

Algorithm: 

Stage  1: 

.     ♦ 

1.  Forif  k  partitions  of  size  ni. 

2.  Set  ppn  =  '^PN  mod  m. 

3.  Repeat  for  i=1  to  n/k-1; 

a.  if  ppn  >(i  mod  m)  ,  then  x  =  0. 

else  x=1. 

h.  fetch  from  f,  RA  =Li/mJ+x. 

c.  left  shift  by  (i  nod  ir)  into  ACC. 

d.  fetch  from  L,    RA=ir-ppn+i- 1  . 

e.  perform  a  flip  route  ir  m-partition  into 

R1. 

f.  fetch  f  element  from  memory  (i  mod 

m)  ,  PA 

=  |_i/mj  . 

*   The   'Form   Partition*   command   forces 

all    subse 

iguent 

instructions   to   conform   to   this   partitioning,   for 

both 

alignment  and  memory  accessing,  unless 

specif 

led  otherwise. 
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g.  broadcaist  to  respective  ni-pai tition  into  P»2. 
h.  multiply  PI  and  R2  into  F3. 
i,  subtract  R3  from  ACC. 
j.  right  shift  ACC  by  (i  irod  vr)  . 
k.  store  ACC  into  f,  FA=[i/mJ+x. 
u.  Done, 


Stage  2: 


1.  Repeat  for  i  =  0  to  iti-1; 

a.  fetch  from  I,  RA  =i,  memories  0  through  jm-i-1). 

b.  right  shift  by  i. 

c.  store  into  <^,  RA-i, 

2.  Form  l^  partitions  of  size  m. 

3.  Repeat  for  i  =1  to  r./k-1; 

a.  fetch  from  G,  RA  =i,  into  ACC. 

b.  repeat  fori=0tom-1; 
(i)    set  h  =  i  +  -j-ir. 

(ii)   if  h<0  then  set  R1=0. 

else  fetch  from  G,    RA=h,  intc  S1, 
(iii)  fetch  L  oiement  from  miemory  i,  PA=i-1. 
(iv)   broadcast  to  respective  m-partitions  into  F2< 
(V)    multiply  31  and  R2  into  F,3. 
(vi)   subtract  P3  from  ACC. 

c.  store  ACC  into  1,  PA=i. 
'4.  Done. 


f^ 


■-TI 


:.!■■' 
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itage    3: 


A.       Perforrr   an   m    x    in  matrix    transpose   in   m-pactition    from 

the    last    m    rows    of    G*s    to    H. 
3.       Fetch    fropi    f,    RA=(n/p-1),    and    store    into    v, 
C.       Repeat    for    j  =  0    to    log  (p/2in)  ; 

1.  set    r    =    (2**1)  *m. 

2.  Form  (p/2r)   partitions   of   size   2r.    Subdivide   each 


partition   into   left  and  right  halves. 


v^,_,   and  Hji.g 

(j)        (j) 
in  the  left,  and  v  qj   and  H  2j_2  will  be   in   the   right, 

1<i<(p/2r). 

(j) 
(H^    is  nonexistent  and  filled  with  all  O's) 

3.    Fetch    from    the    ri)ht    hal^    v    into    riaht    half    kCC. 

U.  Put  O's  into  left  halt  hCZ. 

f.  Eepeat  for  h=0  to  m-2  by  2; 

a)  fetch  from  right  half  H,  EA=h. 

b)  left  shift  l»y  r  into  left  half  Rl. 

c)  fetch  from  right  half  l\ ,    PA  =  h+1,  into  right  half  Rl. 

d)  fetch  froiT'  left  half  v,  memories  (r-ir+h )  and  (r-m+h+1) 
only, 

e)  broadcast  respectively  to  left  and  right  half  h2, 

f)  multinlv  r?1  and  R2  into  E3. 

g)  subtract  ^^3  froir  ACC. 

6.  Fight  shift  AC:  in  left  half  by  r  into  F1  of  right 
half. 

7.  Add  right  half  F1  to  right  half  ACC. 

8.  Store  right  half  ACC  into  right  half  v. 
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0.     If    -1=10^(9/2111)     qo    to    D. 

10.    Ser.oat    for    h    =0    to  m-2    by    2; 

a)  set    ACC=0. 

b)  Lppeat    for    q  =  0    to    iti-1  ; 

(i)  fetch    from   riqht    half    H,    Sfl^q. 

(ii)  fetch    H    element    froir    left    half   rneiriory 

(r-[r*-q)  ,    RA  =  h. 
(iv)  broadcast    into    left   ?.2. 

(V)  fetch    P   eleirent    froir    left    half   memory 

(r-m>q) ,    RA=h+1. 
(vi)  broadcast    into    riqht    R2. 

(vii)         multiply    R1    ancl    E2    into    R3, 
(viii)       aad    r3    to    ACC. 

c)  noqate    ACT. 

d)  store  left  ACC  into  right  half  fi,  RA=h, 

c)  store  riqht  ACC    into  right  half  h,  SA=h+1. 
D.  Done. 


Staqe  U : 

1.  Form   k    m-Partitions. 

2.  Repeat    for    h=0    to   n/p-2; 

a.  Perform   an   rr    x    n    iratrix    transpose    frorr    c;|FA:  (hm)    to 
m(ti-H)-l)     to    H. 

b.  Fetch    from    f,     RA=h,    into    ACC . 

c.  Repeat    for    l=0    to    m-1; 


ii 

Ii 
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(i)     fetch  from  H,  RA=i#  into  R1. 

(ii)         fetch    V   element   from  meirory    i. 

(iii)  '  broadcast  to  the  right  neighbor  m-partition  into 

P?.         • 
(iv)    irultiply  R^    and  R2  into  R3. 
(v)     subtract  R3  from  ACC. 
d.  Store  ACC  into  y,  PA=h. 
3.   Done. 


Analysis : 


Throughout  the  algorithm,  partitions  of  size  m  or 
2r  (=2  m)  are  used.  By  Corollary  2.1,  we  can  see  that  the 
omega  network  (or  some  of  the  full  permutatior.  networks)  is 
capable  of  performing  the  necessary  alignments  because  of  its 
partition  ability.  As  tor  the  different  kinds  of  alignment 
patterns  that  are  required  within  the  partitions,  we  have 
right  and  left  shifts,  flips  and  1-to-many  broadcasting.  All 
of  these  patterns  can  be  passed  by  the  omega  network.  One  of 
the  noteworthy  patterns  can  he  found  in  step  C,5.e  of  Stage 
3.    The   broadcasting   function   has   the   fcrm   {(k,x)<r,2> 

> (X , *) <2, r>} ,   That  this  function  can   be   passed   by   the 

omega  network  is  proved  in  part  (ii)  of  Section  2.2.1.2  of 
this  thesis.  Tn  step  2'C.  (iii)  of  Stage  U,  the  connection 
function  can  be  passed  by  the  oirega  network,  by  virtue  of 
Theorem  2.1,  after  setting  F^  to  a   1-shift   permutation   and 
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Pj^'s  to  1-to-many  broadcasting 


The  m  X  pi  matrix  transpose  in  step  A  cf  Stage  3  and 
step  2. a  of  Stage  ^i  can  be  implemented  as  a  'subroutine'  that 
takes  (m-1)  steps. 

Assume  element  (i/j)  of  matrix  K  is  stored  in  memory 
-j  with  relative  address  i  {0<i,j<m). 

Let  ?PN  be  the  processor  identification  number  and 
®  be  the  bit  by  hit  'exclusive  or'  of  two  integers. 


I2^t£ix  Transpose  Pouting: 
DO  k  =  1  to  m-1 ; 

h  =  PPK  ©  k 

Fetch  elorient  with  index  h  into  PI 

?wap  PI  with  processor  h 

Ftore  ^1  into  M,  PA=h 
LND 


The  detailed  counts  for  various   operation   times   in 
this  algorithm  are  listed  below; 


'A 

r  .»   U 


M 


Stage  1: 


Fetch: 
Stor'3 : 


3(n/k-1) 
(n/k-1) 


*  For  ornega  networks,  we  can  use  the   column   control   method 
described  in  Section  2.1.4,  with  p  set  to  h. 


Alicfument:      U(n/k-1) 
Processor:      2(n/k-1) 
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Staqe    2: 


Ff^tch: 
Store: 
Mignment  : 
Processor: 


2inn/k  +  n/k-ir-1 
n/k+m- 1 
mn/k 
2rr.ri/k-  2m 


Stage    3: 


Fetch: 
Store: 
Alignment : 
Processor: 


Stage  ^ 


Fetch: 
Store: 
Alignment : 
Processor: 


loqjp/m)  (3m^/2+3m/2+1)  -3rr\^/2*m*'^ 


loti  (p/m)  (in+1) 

log(p/iT)  (3ir2/2+2m  +  2) -3mV2 

loa(p/m)  (m^  +  3iii/2<-1) -m/2-m^ 
2 


3inn/p-3n+n/p-1 
(n/p-1)  (m+1) 
2mn/P~  2m 
2mn/p- 2m 


is  for  the  total,  we  get: 


Fetch: 
Store: 

Alignment : 
Processor: 


0  (3m^log2(p/m)/2  +  2m^n/p) 
0  (m.logj^(p/ir)  +3mn/p) 
0  (3m^log2(p/m)/2-»-m^n/p) 
0  (m^log  (p/nr)  +2rr^n/p) 
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As  we  can  see  from  these  fiqures,  the  alignment  and 
memory  times  are  of  the  sarre  order  as  the  processor  time. 
This  iirplies  that  wg  are  not  spending  excessive  delay  in 
either  the  accessing  or  alignment  ot  the  intermediate  ^ata 
elements. 


1*1 
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3, 2, 2  Full  Becuryence^Svstem  Solver 
(using  p-n   processors) 

The  algorithm  we  use  here  is  derived  from  [23],    The 
essence  of  the  algorithm  is  as  follows: 
Assume  we  have  to  solve  for  it  in 


L  X  =  f 


(3.2) 


where  L  is  a  unit  lower  triangular  matrix  of  size  n  x  n, 
while  X  and  f  are  arrays  of  sizes  n  x  h*  Ihe  inverse  of  L 
can  be  represented  as. 


-1 


r 


(3,3) 


where  M,  =  (I-L|ei  )  in  which  L;  is  the  ith  coluiwi  of  L  with 
the  element  L(i,i)  set  to  zero.  The  solution  x  is  then  given 
by 


*    ^  n-i  ^  n-2 


M^f  . 


(3.U) 


Then  we  will  solve  this   product   in   parallel  using 
log  n  stages. 

The  initial  storage  patterns  for  the  arrays  in  the  n^ 
processors/  memories  are  shown  in  Figure  3,3- 
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2 
For  example,  if  2n   =2,  then  the  addition   tree   will 

look  like: 


0   1   2   3   U   5   6   7   8   9  10  1 1  12  13  14  15 


For  the  M      calculation,  a  pictoral  description  of 
what  needs  to  be  done  is  shown  in  Figure  3,4. 


Here,  similar  to  what  we  do   in   the   calculation   of 

(j  +  l)  cj) 

G      »   we   broadcast  ^2'+\     elements  to  the  right  side  R1»s 

(J) 
and  T2i   elements   to   the   right   side   R2»s.    The   partial 

results   will   then  be  in  R3 (1 ,0,i,*,x,y) <2, n/r ,n/2r,r,r,  n>. 

The  summation  of  partial  products   are   done   by   shifts   (of 

2  .r.n)   and   add,   0<h<j.    The  multiplication  and  additions 

will  be  done  at  the  same  time  as  those  in  the  calculation   of 

.  CJ+i) 


i*i 


We  first  define  s  as  the  memory  systeir  of  n  modules 
and  P  as  the  processor  system  of  n  modules.  In  this 
algorithm,  we  need  four  internal  processor  registers  (Rl 
through  RU) . 
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"A 


"^^ 


R(>»,o)RC»*,i) 


R(*,nH)  f  (*)   M?^  M^ 


Mr.. 


^iqure  3.3  Initial  Array  Storage 


G  consists  of  the  h  right  hand  columns  and   H  ^  is 

the   ith   column  of  the  left  hand  matrix,  L,  (1<i<n-1)«  Note 

that  Ho°   is  all  zeros  and  the  nth  column  of  the   matrix  has 
no  entry* 

Before  we  proceed,  we  would  li^e   to  introduce  some 
new  notations  to  simplify  the  description  of  the  algorithms. 


#N  the  set  (0, 1, 2, . . . .H-1} 

*   the  set  (.....,-2,-1,0,1,2,.....) 

All  vectors  are  declared  using  the  notation  A<0>   and 
are  indexed  from  A{0)  to  A  (0-1). 

For  calculation  of  G      ,  we  will  first  broadcast  Hi 
to   the   left  half  Rl»s.   Then  we  broadcast  I  elements  to  the 


(j+n 


will  be 


left  half  R2»s.   Then  the  partial  results  of  G 

in   R3  (0,0,*,x,y) <2,n/4r,r,2n,n>.    The  summation  of  partial 

results  are  done  by  shifts  (of  2    *n  )  and  add,  0<h<j, 


77 


J  i 
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Figure  3.4      mP"*"^  ^     Calculation 
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The  algorithm   is  shown   belov: 


Algopjlthin: 

A.  Repeat  for  j  =  0  to  (log2n-2) ; 

1.  Let  r=2**j,  r  •=max  (r/«t,1)  ,  r'»»inax  (r/2,  1 ) ,  r=2r". 

2.  Declare  S(PA=G)  as  Q<2,n/U, 2n,n>. 
Declare  S  (BA=M)  as  il1<2,n/2r  •  ,n/r,r  •  ,r,  n>. 
Declare  S(FA=P!)  as  W2<2,n/2r"  ,n/2r,  r",2r,n>. 

Then  for  i=0  to  (n/2r-1) , 

(J) 
G    <2n,n>  is  in  Q(0,0,*,*) 

y^'^'  <2n,r>  is  in  Q  (0,0,*,#r>  (r-1) ) 


Mg"^.^  <r,n>  is  in  W1  ( 1,0, 2i,0,*,*)  ,  M^^^ 


IS 


immaterial. 


(J) 


"21  +  1  <r»n>  is  in  W  1  (1 ,  0,2i>1,0,  *,*)  , 
T^^^  <r,r>  is  in  Wl  (1,0,2i,0 ,*,#r*  (2i 
M^j"^'^  <2r,n>  is  in  W2  ( 1,0,i,0,*,*)  . 


3.  Declare  P  as  P1<2,n/Ur,r ,2n,n>. 

Declare  P  also  as  P2<2,n/r,n/2r,r,f ,n>. 
Then  G      calculation  uses  P1  (0,0,  *,*,*)  , 
while  n^'!*^^      calculation  uses  P2  (1 ,0,i,  ♦,*,*)  . 

«».  Fetch  »\^^   {*,*)    from  W1  ( 1,0,  1  ,0,*,*)  . 

5.  Broadcast  D  (1 ,0, 1 ,0, x,y)*  to  R1 (0,0, x,»,y)  of  PI, 
Vx,v. 


'^  D  is  memory  data   register.    Declaration  always  follows 
whatever  is  to  be  fetched  or  stored. 
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6.  Fetch  Wgi  +  i  (*f*)  from  W1  (1,0,  2i+1,  0,  ♦,*) . 

7.  Broadcast  D (1,0, 2i*1 ,0,t  ,y)  to  R1 (1 , 0,i ,x, ♦, y)  of  P2, 
Vx,y, 

8.  Fetch  Y^^'  (*,*)  from  Q  (0  ,0  ,♦,  #r*  (r-1) )  . 

9.  Broadcast  D(0,0,x,y)  to  B2  (0, 0,y,x,*)  of  P1,Vx,  and 
vyef*r*(r-1)}  . 

10.  Fetch  T^^V  (*»♦)  from  Wl  (1,0,2i,0#*, #r«-(  2i+ 1)  r-1)  . 

11.  Broadcast  D  (1,0, 2i,0, x,y)  to  R2  (1,0,i,y  ,x, *)  of  P2, 
Vx,  and  V y eflr+(r-1)  } . 

12.  Multiply  R1  and  R2  into  R3. 

13.  Repeat  for  g=0  to  (1-1); 

a)  Set  RU=0. 

2  -      a  _ 

b)  Declare  P  as  P3<2,n  /  (2**  (q*2) r) , 2, 2  ,r,n>. 

c)  Left  shift  R3  ( 1,*, 1,0, *,*)  of  P3  by  2  rn  into  R4. 

d)  Declare  P  as  PU<2,n/2**  (q  +  3)  ,2,2'^,2n,n>. 

e)  Left  shift  R3 (0,* ,1,0 ,*,*)  of  PU  by  2^.2n^  into  Ra. 

f)  Add  P3  and  RU  into  R3. 

la.  Fetch  M  2;  (♦,♦)  from  Hi  ( 1,0,2i,0,  *,♦)  . 

15.  Right  shift  D (l , 0,2i ,0,x,y)  by  (ir  n/2)  to  R2 
(1,0,i,x,y)  ,Vx,y. 

16.  Fetch  G^"''  (*,*)  from  Q(0,0,*,*). 

17.  Transfer    D(0,0,x,y)  to  R2  (0,0,0, x,  y)  of  P1,Vx,y. 


'^  This  step  will  be  skipped  when  j=0* 


^n^ 


Transfers  need  no  alignment. 


i0» 

1  <• 


I  ■■3 
-1* 


c: 


;c3 


C3 
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18.  Add  F2  and  B3  into  R2. 

19.  Transfer  R2  (0,0, 0,x, y)  of  PI  to  D(0,0,x,y)  of  Q. 

20.  Store  D(0,0,*,*)  to  G^"*"^'^  (♦#♦)• 

21.  Fetch  H^^li  (*,*)  from  Wl  (1,0,2i>1 ,0,  ♦,*) . 

22.  Right  shift  D (1, 0,2i+1,0 ,x, y)  of  81  to  E 
(1,0,i,0,x*r,y)  of  H2  by  (ir^n/2-r^n/U* m)  V  K,y. 

23.  Transfer  R2  (1 ,0,i,0, x,y)  of  P2  to  0 (1,0 ,i,0,x,y)  of 
W2,  Vx,y. 


2a.  store  D(1,0,i,0,x,y)  into  M 


fj+i) 


(x,y)  Vx,y. 


B.  For  i  =  loqgn-l; 

1.  Let  r=n/2. 

2.  Declare  P  as  P5<n/2 ,2n,n>. 

3.  Declare  S(PA=G)  as  Q1<n/2,2n,n>. 
Declare  S(RA=«)  as  H3<2, 8,n/8,n/2,n>. 
Then   G  ^  <2n»n/2>  is  in  Q1(0,*,*), 

Y^^^  <2n,n/2>  is  in  Q1  (0,  *,  #  (n/2)  ♦(n/2-1) )  , 


(J) 


(J) 


<n/2,n>  is  in  W3 (1 , 1 ,0,*, *) . 


U.  Fetch  «7  (*,*)  from  H3  (1, 1 » 0,*,*) . 

5.  Broadcast  D(1,1,0,x,y)  to  R1  (x,*,y)  of  P5,  Vx,y. 

6.  Fetch  Y^"^^  (♦,*)  from  Q1  (0,  *,#  (n/2)  ♦  (n/2-1) )  . 

7.  Broadcast  D(0,x,y)  to  B2(y,x,*)  of  P5,  Vx#  and 
Vyef*  (n/2)  ♦(n/2-1)). 

8.  Multiply  R1  and  R2  into  R3. 

9.  Repeat  for  q=0  to  j-1; 

a)  Set  r4=0. 

b)  Declare  P  as  P6<n/2** (q*2) ,2,2  , 2n, n>. 
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c)  Left  shift  R3  (♦,  1 ,0,  *,*)  of  P6  by  2  .2n.n  into  Ri», 

d)  Add  R3  and  RU  into  R3. 

10.  Fetch  G^"^^  (*,*)  from  Ql(0,*,*). 

11.  Transfer  D(0,x,y)  to  E2(0,x,y)  of  P5, V  x,y. 

12.  Add  R2  and  R3  into  R2. 

13.  Transfer  R2(0,x,y)  of  P5  to  D(0,x,y)  of  Q1,Vr,y. 
lU.  Store  D(0,*,*)  to  G^*'*'^  (*»♦). 


C.  Done. 


Analysis; 


Steps  A.  5  and  A. 7  use  a  broadcasting  function  that  is 
omega  passable.  We  first  apply  part  (ii)  of  the  broadcasting 
theorems  which  shows  that  { (K:,  x,y)  <2n,rrn>--->  (x,*  »y) 
<r,2n,n>}  is  omega  passable.  Then  we  can  apply  the  omega 
partition  theoreni  to  allow  for  the  shift  in  partitions.  The 
broadcasting   function   in   Step  A. 9  and  A. 11  are  of  the  form 

f  (!Cr*^#y)  <n,r,n> >  (y,  x,*)  <r,r  ,n>}  .    They   are   also   omega 

passable  because  of  Part  (iv)  of  the  broadcasting  theorem 
(notice  that  a>c  since  a=c=n) ) ,  and  the  omega  Partition 
Theorem,  Step  A. 13  is  the  repetitive  shifts  and  adds 
described  earlier  in  this  section.   The  broadcasting  function 

in    Step   B.5    is    of   the   form   f(k,x,y  )<2n,n/2,n> > 

(x,*,y) <n/2,2n,n>}  and  that   in   Step   B.7   is  of   the   form 

f  C^^x^y) <n,2n,n> >   (y,x,*) <n,2n,n>) .   Both  ace  passable  by 

omega  network. 


es 


-A 


is 


IS 

is 
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The  operation  times  for  this  algorithir  are: 
Fetch    :      lloq^nrk 
Store'  :      21og2n-1 
Align    !       (loq^n)    ♦lloggn+l 
Processor:      log  n (log2n*3)/2 
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3.2.3    Usin£_^an_^_P£ocesso£S 

This   -Algorithm    is    derived    from   [  2^^  ].       We      will      solve 

2 

?<n,m>      with      p=n^   ru         However,       if      the      numter    of    availaole 

processors    is    less    than    this,    we    will    have    to    use    foldinq. 

The    thp'oretical    processor      bound      found      in      [22]      is 

m (m* 1) n/?-m3 .         However,       if    m    ^n6    n    are    powers   of    two,    for    d 

2 
to    be    li    power    of    two   also,    we    have    to    use    p-m  (2n)  n/2  =  m    n. 


The    matrix    L    arid    the    vector    f    can    he    viritten      in      the 


for.Ti, 


Lo 

Rg    L2 


R 


m    '       my 


i  = 


ft 

f2 


fn 


s-i 


.1    >• 

ira  C  5 

-It  3 


<i 


where  Lj  and  Fj  are  rr  x  m  unit  lower  triangular  and  upper 
trianqular  matrices,  respectively.  Preinultip  lying  both  sides 
Lx  =  f  by  the  matrix  D  =  Idiag  L ;  J,  we  obtain  the  system 
L    X  =  f     where. 
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and 


L 


CO) 


m 


CO) 


Go"'  I 


m 


Gr,_.I 


I  -^  m 


r  (0)    ^(o) 


1 


Fjlf; 


i=1 ,  2, .  .  .  n/iT-1 . 


This  will  be  callefi  the  ini tializatior  part  of  the 
algorithnr,  foe  it  sets  up  the  data  for  the  nain  part  of  the 
algorithm. 


Then  for  the  nain  Fart,  we  form  the  sequence   L 


(j+i; 


and   f       for  j='^,  ',...,  log  (n/m)  - 1 .   Kach  matrix  L    is  of 


the  form 
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L 


(J) 


G^ 


Ir 


Gn     .    I 


r-' 


whera    r   =2    m.       For    the     (j+1)th    stage,    we    have 


(j+i) 


(J) 


f; 


(j  +  i) 


T2i 


CJ)     (J)        (J) 

"^21+ 1  T2i  +   '2141 


1  =  0,1,  ...2f-1 


The  initialization  p^rt  will  l^  done  usiaq  the  method 
•lescribed  in  Section  j.2.2.  We  will  not  repeat  the  discasion 
here.  The  second  stage  of  the  algorithm,  however,  needs  some 
discussion. 


.1    <• 


>r: 


.>>>   0 


a 


(J)       (j) 

At    step    -j  +  1     (0<  j<log  (n/ir)  )  ,    we      calculate      ^  2\  +  \   '  ^  2\ 

(J)     CJ) 

and    G     .  f.   „-     at    the   same   tiir^e   ( 1<i  <  (n/2r )  )  .    Each 

calculation  uses  rn   processors.   The  '.'^    calculation   is   done 

2  2 

\x\    the  left  m  n/2  portion  of  the  p  (=.ti  n)  processors  while  the 

f  calculation  is  done  in  the  right  m  n/2  portion.   G..^,   has 

-3         /    L  2i+i 


to  be  broadcasted  to  both  portions. 


The   memory    system    of 


m  n 
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is        broken        into 


2  (J) 

r  m/2-pirtitions.         G  .      (0<i<n/r)       is      stored      in    (0,1,0,*,*) 


I 


<2,n/r ,m/3,r,m>,         while         .    , 
(1,i,0,*,0)    <2,n/r,m/2,c,m>. 


f^-^    (0<i<n/2r) 


is        stored 


in 


The    f   calculation    will    need   one      extra      step      to      adl 

^  (0)  (J)  (J) 

f  g.^j     to   the   product    G  g,-^,    .£21     • 


AjLqorithir : 


Stage  1  (Initialization)  : 


1.    If    .71=1,    the   systeiTi    is    already    initialized,    wg   can    qo 

directly  to  Stage  2. 
2«  It"  m=2,  theCa  is  only  one  of f-diaqonal  eleoent  in  each  of 

the  L|*s,  C<i<n/2.  We  can  solve  each  of  the  n/2  systems 

in  (0 ,i  ,*) <2 ,p/2 , U>  in  two  steps: 

<ii  (•r*)=^i  (1,*)-L|  .Gj  (0,*)  ,  where  Gj=[R  j|f  .  ]. 

P.  will  be  in  (0, i , ♦, *) <2,n/2, 2, 2>  and  fj  at  (1,i,*,0) 

<2,n/2,2,2>  respectively. 

We  then  require  two  fetches  and  aliqns  to  route  them  to 


(0) 


(0) 


at  (0,1,':^,*,*)  <2,n/2,1,2,2>  anl  f.    at  (1,i,0,*,0) 


<2,n/2, 1 ,2, 2>  respectively. 
3.  If  m>2,  W2  will  use  the  method  described  in  Section 

~  3 

The  Gj  calculation  will  be  done  in  (0, i  ,*)  <2,n/m,m  /2> 
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and  1  calculation  be  done  in  (1  #  i»  ♦)  <2  ,n/ni  ,nn  /2>  for 

0<i<n/m. 

Initially,  Pj  will  be  stored  in  column  majcr  order  in 

(0,  i  ,  0,  ♦,  *)  <2,n/m,in/2,rn,m>,  Resulting  Gj  and  t;  will  be 

in  where  3j  and  f j  were. 

For  stage  2,  we  want  G  ^?^   to  he  in  (0,i,0,*,*) 

(0) 

<2,  n/r.,  m/2,Trir  nn>  in  row  rrajor  fashion  and  t  j   to  be  in 
{1,i,C,*,0)  <2,n/rr.,ir./2,m,m>. 

Hence  we  want  to  route  (0,i,C,x,y)  to  (C,i,0,y,x),  and 
also  (0,1,1,0,7.)  to  (1,i,G,z,0)  Vx,y,z.  Both  routes  are 
linear  permutations  and  can  le  realized  by  the  omega 
networVc  in  two  passes,  due  to  the  results  described  in 
Section  2. 2. U, 


Stage  2  : 

A,  Repeat  for  i  -  '^  to  log  (n/2ir.)  ; 

1.  Let    r=2**i.m. 

2.  Declare    S(p.'\=G    or    f)     as    M<2  ,  n/2r  ,m,  r,m  >. 
Then    for    C<i<n/2r, 

G  2i    <r,m>    is    in    M  (C  ,  i,  0  ,*,  *)  ,    G   ^       faeirg    all    O's. 


r,m>    is    in   M ( G ,i , m/2 , *  ,*)  , 


2i+l 

f  ^l?  <r>    is    in    iv  n,i,C,*,0)  , 
(j) 


^  ii+i  '^'^^    i^    i"    M(1,i,in/2,*,C)  ,    rest    of    !•:  C  ,*,*,*,  *) 
=  0. 
3.    Declare    P    as    '^1<2,n/2r,n,r,m>. 


Its 

ili 


is 
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CJ) 
U.    Fetch   G  2j+,    [*,*)    from   M  (0,i  ,iri/2,  *,  *)  . 

5.  Broadcast    C  (0,i  ,iii/2  ,  x,  y)    to    B1  (*,  i,  y,x  ,*) 
of    P1,  Vx,y. 

6.  Fetch   G^^^    (»m+(r-m),*)    and    f  j**;^    (#in4(r-iT))     from 

M  (*,i,0,#in+  (r-m)  ,*)  . 

7.  nroddcast  D(z,i,C,x,y)  to  E2  (z,i ,x, *, y)  of  Pi, 
VyrZ#  and  Vx€f#n+ (r-m) )  . 

fi.  Multiuly  PI  and  F2  into  E3. 
9.  Repeat  for  q=  0  to  (loggm-l) ; 

a)  Declare  P  as  F2<nrn/ (2**  (q+1 )  r)  ,  2  ,2  ,r,m>. 

b)  Left  shift  R3  (*, 1 ,0, ♦, ♦)  of  P2  by  2  rm  into  RU. 

c)  Add  ^3  and  R4  into  P3. 

10.  Set  P'4=0. 

(0) 

11.  Fetch  ^  2i  +  |  <*^  ^^°'"  M(1,i,rr,/2,*,0)  . 

12.  Left  shift  D  (1 ,  i  ,ir/2  ,*  ,  C)  into  FU  by  rmV2. 

13.  Subtract  R3  froir  Ru  into  F4. 

14.  I^ight  shift  RU  ty  rir  into  D. 

15.  3toce  D(*,i,1,*,*)  into  M  (* , i, 1 , * , *) . 


B.  Done. 


^na l^s is : 

The  broadcasting  function  in  Step  A. 5  of  Stage   2   is 

of   the   form  f  (l' ,  x,  y)  <ir,  r  ,m> >  (y  #  x  ,  *)  <ir  ,r,  it>)  and  is  omega 

passable.   The  broadcasting  function  in  Step  A, 7  of   Stage   2 


I 
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is  of    the  form  [  (k  ,  x#y)  <ir,,  r  ,m> >  (x,  *,y)  <m,  r  ,m>}  and  is  also 

omega  passable. 

The  operation  times  for  Stage  1  cdn  be  easily 
obtainad  by  substitutirg  n  by  m  in  the  operation  times  listed 
in  section  3.2.2.  However,  we  have  to  add  tour  alignment 
passes  for  the  linear  permutation  functions  described  in  the 
last  part  of  Stage  1  if  an  omega  network  is  used.  If  a 
crossbar  (or  any  other  full  permutation  network)  is  used,  we 
only  need  to  add  two  passes. 

The  operation  times  for  stage  1  are  then: 
Fetch  :   ^log^m-U 
Store  :   2iog-m-1 
Align  :   (log-m)  +mog2m+1 
Ptocessor:   log  m  (log-m+3) /2 

For  ^tage  2,    the  operation  times  are: 
Fetch  :   11ogj^{n/m) 


Store 


log^Cn/m) 


ml 

a* 


Align  :   log  (n/m)  +log2(n/m)  log^m 
Processor:   2iog  (n/m)  +  log  (n/ir)  log.m 


The  total  times  for  this  algorithm  arc  then: 
Fetch  :   31og  n+Ulog^rr-U 


Store 

Align 


loq^n+loggm-l 

loggia*  log^m  +  loa  -n  +  31og   m*  1 


Ptocessor:   log. n.logm-  (lcg,p )  /2  +  21og„n-log_m/2 
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^.2.'*    0sin(i_a_J1O(]erate_NuOTber__o£_Pcocessors 

"his  algorithm  is  deLived  from  [2a],  For  the  matrix 
multiplication,  however,  this  implementation  does  not  use  the 
loqsum  method.  Instead,  it  uses  the  uore  efficient 
parallel-product  serial-surr        metLod.  The        preliminary 

discussion    of    this    algorithm      will      be      similar      to      that      of 
Section    3.2.3. 


For   each    staqe     (j+1),     {0< j<log  (n/m) )  ,    we    have      n/2r-1 

(J) 

'"t         's.    We   allocate   rm   processors  for  each  of  these  G's. 

o 

For  j='^  (or  r=m)  ,  we  need  a  total  of  (n/2n--1)m  processors. 
Since  we  assume  that  n,m,p  are  all  powers  of  two,  therefore 
we  have  p= (n/2m) xm   =mn/2. 


In  .subsequent  descriotion,  we  will  assume  p=mn/2.  If 
in  actual  case,  p<mn/2,  then  we  will  apply  folding  to  the 
algorithm. 


•f  L  ^ 


1 
Ch3)    1 
(2,2)  (2,3) 


1 


(3.1)  (3,2)   (3,3)    1 

(4,0)  (4,1)   (4,2)  (4,3) 

(5,0)   (5,1)  (5,2) 

(6,  0)  (6,  J  ) 
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In  qGFK^ral,  let  (i,j)  be  the  (.n-j)th  element  on  the 
ith  row  of  L,  wherp  C<i<n  and  C<j<in.  Then  (i,1)  wilL  b'^ 
stoued  in  memory  ((i  mod  in)*:n+-j)  of  the  j_i/2j  th  m'-pact  it  ion. 
For  the  L  matrix  qivcn  above,  the  storage  rrap  of  the  first 
m  -partition  i*^  shown  in  Figure  3.o. 


Mamorv 

^      1      2       3      H  14  15 

(0,0)   (0,1)   (0,2)   (C,3)  (1,0)  (3,2)  (3,3) 

(4,0)  (U,1)  (4,2)   (4,3)  (5,0)  (7,2)  (7,3) 


L  : 


Figure  3.6  Storage  Kap  of  the  First  m   -Partition 


The  ith  element  of  f  is  stored  in  the  saire  memory 
module  as  (i,r) . 

For  initialization,  we  will  partition  the  system  into 
n/2m  rr  -partitions.  We  will  solve  G  j  for  i  even  first,  then 
t{  for  i  even,  then  r,  j  for  i  odd,  and  finally  f;  for  i  odd. 
m  multiplies  and  m  additions  will  be  required  for  each  of  the 
calculations.  Kence  i  total  of  2iTi*4  =  8m  steps  will  be 
required  for  the  initialization  part. 


?-1 


1*1 

■4  0  rf 

a 


•s 


Before  we  present  the  algorithm,  we   first   define   a 
swap  in  k-partition  as  a  k/2-end-around-shif t  in  k-partition. 
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Algorithm : 

Stage  1  (initialization) : 

A)  Ml  arrays  arc  declared  as  <n/2in,in,m>. 

3)  Pepeat  for  h=0,1; 

1.  Transfer  L  (i,  1 , x)  /nA=h ,  to  Ll(i,j,x),  V  i,i. 

and  V  X  such  that  m-j<x<rr. 

2.  Left  shift  by  j  in  m-partition  L  (i, 1 , x) , FA=h,  to 

P(i»1»x),v  i,i,  ard  V  x  such  that  0<x<m-j. 

3.  Repeat  for  q=0  to  ir-l  ;                      •  • 

a.  Fetch  R(i,q,x) ,  V  i,x. 

b.  Broadcast  D(i,q,x)  to  Rl(i,*,x). 

c.  Fetch  LI  (i,  i,:n-j)  ,  V  i,j. 

d.  Eroadcdst  D(i,j,m-j)  to  R2(i,j,*). 

e.  Multiply  ni  and  R2  into  P3. 

f .  Fetch  P  (i, 1,k)  ,  V  i,1/H. 

g.  Transfer  r(i,j,!c)  to  ti2. 

h.  Add  R2  and  R3  into  R3. 

i.  Transfer  R3  to  D. 

i.  Store  D(i,i,k)  into  R(i,i,k),  V  i,1,k. 

• 

U.  Repeat  for  q=0  to  m-1; 

a.  Fetch  f(i,q,0),  V  i,  P.A=h. 

b.  Broadcast  D(i,q,0)  to  F1(i,*,0). 

c.  Fetch  Li  (i»  i,m--j)  ,  V  i,j. 

d.  Left  shift  3(i,j,m-j)  by  (m-1)  to  R2(i,j/0). 
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Figure  3.5      Storage  Map  at  Step  j  of  Stage  2 
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G.  Multiply  F1  and  R2  into  R3. 

f .  Fetch  f  (i,i,0)  ,  V  i, j. 

g.  Transfer  D(i,j,0)  to  R2. 
h.  Add  R2  and  r3    into  R3. 
i.  Tran3f9r  n3  to  D. 

j.  Store  D(i,j,0)  into  f(i,j,0),  V  i,j, 


C)  Done. 


Staqe  2: 


A)  Repeat  for  j=0  to  log(n/rT)-1; 

1.  Set  r  =  2''m. 

2.  Declare  all  arrays  ris   <r/2r,r,ni>. 

3.  Fetch  f  (i,k,0)  , RA^I ,  i    i,k. 
U.  Transfer  ^{i,k,0)  to  hCC. 
"i.    Repeat  for  q  =  0  to  m-l; 

a)  fetch  :{(i,k,q),FA,  Vi,k. 

b)  left  shift  D  by  q  to  R1. 

c)  fetch  f  (i,r-iT+q,0)  ,  :^A=0,  V  i. 

d)  broadcast  D  (i,r-Tn*q  ,  C)  to  R2(i,*,0). 
a)  multiply  PI  and  R2  into  R3. 

f)  subtract  R3  from  ACC, 

6.  Fetch  f  (i, 1,0) , RA=0,  V  i,j. 

7.  Transfer  D(i,-i,C)  to  R1. 

B.  Transfer  P1(i,j,0)  to  D,  V  j  and  V  i  even, 

9.  Swap  ACCd*-),!))  in  2rm-part i tions  to  D,  V  j  and 
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10, 
11, 

12. 
13. 

ia. 

16. 


17. 
IB. 
n. 

20. 

21. 
22. 

23. 
24. 


V  i    eVen. 

StouG    D    to    f,    RA=0. 

Swap    ni(i,j,0)     in    2rir-pirt  itionii    to    D,    V    j    and 

V  i    o'ld. 

Transfer    ACC(i,j,0)     to    D,    V    i   odd,    and    V    j. 

Store    D    to    f,    FA=1. 

If  i=loT  (n/rr) -1  ^  then  qoto  B. 
2 

Set  ACC=0, 

Repeat  for  q=  0  to  it- 1  ; 

a)  fetch  F(i,k,q),  P.A  =  1,  ¥  i,!c. 

b)  broadcast  0(i,k,q)  to  R1(i,k,*). 

c)  fetch  ?  (i,r-ir  +  q,k)  ,  PA^O,  V  i,k. 

d)  broadcast    >)  (i,  r-m-^^/K)    to   I^2(i,*,k). 

e)  inultiply    "1    and    F2    into    R3. 

f)  subtract  R3  from  ACC. 
Fetch  P  (i  ,  i  ,  k)  ,  r.  A=0,  V  i,j,k. 
Transfer  n(i, -],>;)  to  HI,  . 

Transfer  R1(i,1,k)  to  D,  V  i,k   and  V  i  even. 
Swap  ACC(i,-j,K;)  in  2riit- partitions  to  D,  V  j,k  and 

V  i  even. 

Store  b  to  H,  RA=0. 

Swap    Pl(i,j,k)     in   2rir-p  irtit ions    to    D,    V    j,k    and 

y    i    odd . 

Transfer  ACC(i,j,k)  to  0 ,  V  i  odd,  and  V  i,k. 

Store  D  to  R,  ?A=1. 


■^! 


-.1 


22 
C3 


n)  Done. 
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Jlnalxs  is : 

All  of  the  =iliqnment  functions  used  in  this  algorithm 
can  be  easily  shown  to  be  omega  passable,  by  the  simple 
application  of  the  omeya  Partition  Theorems.  Steps  9,  11, 
20,  and  72  of  Stage  2  show  the  swap  operation  described 
earlier  in  this  section. 


The  total  times  for  this  algorithm  are: 
Fetch:   log2(n/ir)  (Um  +  3) 
Store:   Uloq  (n/m) ♦Um-2 
Align:   log  (n/m)  (Um  +  U) +6rr. 
Processor:   4m,  loq  (n/m)  ■♦•6m 
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3 .  3    Matrix, Hill tipXication_oii_a^Par§ll9l_PCQcessinq_?Y^t em 


A      Fortran      code        section        that        performs        matrix 
multiplication   is   as  follows: 


10 


DO  10  T=1,N 
DO  10  J=%N 
DO  10  K=1,N 
5(I,J)  =A(I,J)  +B(I,K)*C  (K,J) 


An  efficient  way  to  perform  the  calculation  would   be 
to   compile   the   product   by   rows   (parallel  on  J)  as  shown 

below, 

DO  10  T=1,N 
DO  10  K  =  '',N 
10    A(I,*)  =A  (I,*)+B(I,K)*C  (K,*) 

2 

This  algorithm  will  require  0  (H  )  shifts  to  align  the 

operand    matrices.    A   one-stage   perfect   shuffle   network 
simulating  an  omega  network  will  take  log^ N  steps  per   shift, 

and   the   Illiac   IV  type  of  switch  will  take  0 (Jn)  steps  per 

2  2  r~" 

shift  on  the  average."  So  a  total  of   0(N  log.N)   or   O(NJN) 

routing   steps   are   required   for  matrix   multiplication. 

However,  using  the  algorithm   which   follows,   we   need   only 

n(N^)  steps. 
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We  first  need  to  define  the  following  two  notions: 


NQtation:   If  G  is   a   permutation   of   some   input   set,   G 
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implies      i      consecutive      applications  of   the    permutation  G    to 
the   input  set. 

2sfi!liti2Q*      *    G-permutation  is  defined   as      a      permutation      G 

2        3  M 

such   that   G,   G    ,    G    ,...,G      are   distinct  and   form  a   group  with 

G      =1,    the   identity  permutation. 

Fvery  G  permutation  can  be  uniquely  represented  as  a 
cycle    (io,i,  ,. .  .Xm.,  )    where  G  (io)  =i|  ,G  (i, )  =i2#. .  #G  (i  m.,  )  ^i^, . 

Two  obvious  G-permutations  are  the  ♦1  shift 
permutation  and  the  -1  shift  permutation.  In  general,  ^k 
shift  and  -k  shift  permutations  will  be  G-permutations  if  k 
is  relatively  prime  to  N.  some  nonshifting  G-permutations 
can  be  found  using  a  perfect  shuffle  based  permutation.  The 
G-permntations   have   a    general    form  of: 

G(i)    =   [2i*b(i)  ]mod   N, 

where   h  (i)    =    b(i+N/2)     V   i=0.. .N/2-1, 

and        b  (i)    =   0    or    1      V    i. 

A  list  of  all  {b  (i) ,i=C.. ..N/2-1}  that  will  give 
G-permutations  for  N=U  and  8  and  the  corresponding 
G-permutations   are    listed    in   Table    3.1. 
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Size 

b(i) 

G-permutation 

u 

1    1 

(0    13    2) 

8 

110    1 

(0137652a) 

10    11 

(0    1    2    5    3    7    6    U) 

Table   3.  1 


Assume  we  want  to  multiply  two  matrices  A  and  B  to 
form  C  and  that  they  are  all  of  size  NxN.  The  first  method 
uses  N  processors  and  requires  that  the  storage  scheme  for 
the  matrices  be  1-skew  and  1-skip.  The  storage  pattern  is 
shown  in  Figure  3.7,  Each  processor  will  have  a 
corresponding  memory  from  which  it  can  fetch  data.  Any  data 
a  processor  wants  but  not  in  its  own  memory  will  have  to  be 
routed  from  the  other  processors.  This  algorithm  also 
calculates  the  relative  address  (RA)  for  each  array  it 
references . 
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Figure  3."^    1-skew  1-skip  Storage  Scheme 


Fach  processor  has  a  wired-in  processor  port  number. 
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PPN  (0  <  PPN  <  N-1  ).   T  is  a  temporary  array. 


Rlaorithm: 

ft)  Repeat  for  TC  =  0  to  N-1; 

1.  fetch  A,  Rft=ir,  into  R1. 

2.  set  IR  =  (PPN-IC)  mod  N. 

3.  repeat  for  IT  =  0  to  N-1; 

a.  fetch  B,  RA  =  IR,  into  P2. 

b.  multiply    R1    and   R2  into   P3. 

c.  K-permute    IP. 

d.  store   T,    PA= (PPN-IR,mod    N)    from   R3. 

e.  G-permute    R1. 
U.    set    P1=0. 

5.  repeat   for   IT  =   0   to    N-1; 

a.  fetch    T,    RA=IR,    into   R2. 

b.  add  R1  and  R2  into  R2. 

c.  G-permute  R1. 

d.  G-permute  IR. 

6.  store  C(RA=IC)  from  PI. 

B)  Done. 


The  significance  of  this  result  is  that  for  certain 
one  stage  networks,  if  there  exists  a  G-permutation,  then 
each  intermediate  routing  will  take  only  0(1)  time  instead  of 
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OfloqgN)   time   or   0  (Jn)   time.    This   greatly   reduces  the 
alignment  time  for  the  system. 


There  are  many  variations  of  this  algorithm,  each  for 
a  different  memory  skewing  scheme.  Two  of  them  can  be  ased 
for  a  parallel  processing  system  with  twice  the  number  of 
memory  modules.   They  will  be  presented  in  Appendix  A. 

2 

There  is  another  algorithm  which  uses  N    processors. 

2   2 
However,   it   works  only  for  a  N  xN  network  with  tN  shift  and 

♦1  shift  connections.   This  algorithm  takes  a   total   of   N+2 

memory   fetches   and  N+1  memory  stores  .   The  total  number  of 

alignment  reguests  is  3N  and  the  total  number   of   arithmetic 

operations   is  2N.   Hence  the  alignment  time  matches  in  order 

of  magnitude  with  the  memory  and  arithmetic  operations. 

The  initial  storage  scheme  is  simple.  All  matrices 
are  stored  in  a  linear  manner,  i.e.,  element  (i,j)  will  be 
stored  in  memory  (Ni+j) • 

R.lc[orithm: 


■iCD 


1:J 


i| 

•a 

:s 


A)  Fetch  B  into  R1, 

P)  Repeat  for  |  =0  to  N-1; 

1.  store  R1  into  T,  PA=j' 

2.  left  shift  PI  by  N, 

C)  Fetch  A  into  P1. 

D)  Set  ACC=0. 


SM: 
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E)  Repeat  for  1  =0  to  N-1; 

1.  Fetch  T,  RA=(PPN-i-j)modN#  into  R2. 

2.  multiply  R1  and  R2  into  R3. 

3.  add  R3  to  ACC. 

U.  riqht  shift  Rl  by  N  into  R2. 

5.  transfer  R2(i,0)<N,N>  to  Rl  (i,0) <N, N>. 

6.  left  shift  F1  by  1. 

F)  Store  ACC  into  C. 


The  above  two  algorithms  show  how  computations  can  be 
tailored  to  fit  a  simple  network  so  as  to  minimize  the 
routing  times. 
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a.   PROCESSOR,.  SYSTEM  SIHDLMIO  N.,TECHNIQOES 


a, 1  Introduction 


In  order  to  evaluate  the  true  effectiveness  of  a 
parallel  architecture,  we  irust  hypothesi2e  a  compiler 
capable  of  compiling  ordinary  programs  into  code  which  most 
effectively  utilizes  the  architecture,  especially  the  data 
alignment  capabilities.  The  resulting  code  could  then  be 
simulated  and  the  important  performance  measures  determined. 
This  is  the  objective  of  our  Analyzer/Simulator  project.  It 
involves  the  simulation  of  program  execution  on  some 
proposed  parallel  processing  systems.  The  front  end  of  this 
project  is  a  program  analyzer  which  accepts  Fortran  source 
programs,  and  by  detailed  analysis  of  the  control  and  data 
dependencies  it  produces  a  highly  parallelized  version  of 
the  original  program  (see  [26]).  Next,  this  parallelized 
version  is  input  to  another  program,  the  Pesource  Request 
Generator  (PRG)  ,  which  atteirpts  to  compile  the  parallelized 
progranr  into  simulatable  code.  The  code  is  a  set  of  machine 
resource  requests  with  data  dependencies  embedded  in  it.  A 
machine  resource  can  be  a  scalar  or  array  processor,  an 
alignment  network,  or  the  whole  bank  of  array  memories.  The 
task  of  the  PPG  is  to  decide  on  the  best  way  to  slice  the 
comoutation  specified  by  each  instruction  node,  based  on  the 
size  of  the  matrices,  the  number  of  available  processors, 
the   matrix   storage   scheme,   and   the   type   of   alignment 
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network.  Finally,  the  output  of  the  RPG  is  input  to  a 
simulator  capable  of  siirulating  a  wide  variety  of 
architectures.  Here  the  time  required,  utilization  of 
various  resources,  and  speedup  and  efficiency  of  the 
program's  execution  in  the  qiven  parallel  processing  system 
will  be  calculated.  Machine  organization  parameters  can  be 
specified  by  the  user.  These  parameters  include  the  storage 
schenie,  the  alignment  network,  the  processor  and  memory 
speeds,  the  number  of  array  memories,  and  the  number  of 
processors  in  the  array  processor  system. 


A  block  diagram  showing  the  general  organization  of 
the  software  is  shown  in  Figure  u.l.  The  Program  Analyzer 
is  described  elsewhere  [26]  and  we  will  not  discuss  it  here. 
In  this  section  we  will  describe  the  F EG  and  machine 
simulation.  Some  experimental  results  will  also  be 
presented.  In  Section  U.2  we  will  discuss  the  input  data 
structure  ani  available  machine  parameters  for  the  REG.  In 
Section  U,3  we  will  describe  the  output  of  the  simulator  in 
the  form  of  performance  measures.  Then  in  Sections  4.4 
through  U.6,  some  of  the  algorithms  and  strategies  of  the 
RRG  will  be  described.  Finally,  in  Section  4.7,  we  will 
discuss  some  of  the  preliminary  results  of  the  initial  set 
of  exoeriments. 


105 


0) 

o 

to 

c 

s- 

<o 

(U 

+-> 

^ 

<u 

o 

E 

«4- 

(0 

S- 

i~ 

Ol 

<0 

a. 

a. 

i~ 

CD 

o 

c 

-!-> 

•  r- 

<o 

E 

3 

t- 

E 

•  f** 

oo 

c 
o 

c     E 

3       S- 
O       O 


S- 

o 

(/) 

+-> 

T3 

(0 

C 

ZJ 

3 

O 

03 

CO 

> 

LU 

1 

OJ    to 

O  4-> 

S-    to 

=J  <u 

O    3 

to    O" 

(U    O) 

ct:  ce: 

s- 

O) 

o 

o     - 

_>    +J 

S- 

to        (TJ 

Z3 

OJ       S- 

o 

3       OJ 

(/) 

3-      C 

OJ 

(U       (U 

a:     c 

i;      C3 

to 

<u  c 

S-  o 

3  -r- 

4J  +-> 

O  rO 

<U  O 

+->  •!- 

•f-  M- 

.£:  •!- 

CJ  o 

I-  (V 

<C  CL 


O) 
r—     <U 

I—  -a 
im  o 
's-  o 

<o 
a. 


$- 

N 
>> 

c 


C 

o 

•r— 
+-) 

rO 
N 
•r- 

c 
m 

o 

o 
+J 


OJ 
N 
>^ 

rd 

c 


<u 


135  • 


C9 

:c 

"B 


to 

c  E 

i-  S- 

+->  CD 

s-  o 

O  i. 

•U.  Oi. 


106 
U .  2  sijTulatoc  Input  Specifications 

4.2.1  Input  Instruction  Nodes 

The  most  easily  recognizable  form  of  parallelism  is 
typified  by  a  matrix  addition  shown  belo". 

BO  10  1=1, N 
DC  10  J=1,f1 

10    A(I»J)  =B  (I,J) -^C  (I,J) 


The  Program  Analyzer  will  determine  what  the 
dependency  limitations  are  for  each  program  segment  (in  this 
case,  there  are  none),  and  then  break  them  into 
machine-code-like  instruction  nodes.  Each  instruction  node 
will  provide  all  the  information  concerning  the  operator, 
the  two  operands  and  the  result. 

After  the  Fortran  Analyzer  phase,  all  parallel  DO 
loop  indices  are  distributed  into  each  instruction  node. 
The  DO  loop  limits  are  noriralized  to  start  with  0  and  have 
increments  of  1  only.  We  first  assume  that  there  are  n 
active  DO  Loop  indices  in  a  particular  instruction.  The 
-j-th  DC  loop  index,  I:,  inay  have  an  upper  limit,  Uj  ,  as  a 
function  of  I,  ,1^  . . . . ,  Ij.(  .  Assuming  the  function  is  a 
linear  function,  the  U:  's  can  be  represented  by  a  nx{n+1) 
matrix,  D,  such  that 
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1 


Note  that  except  for  the  last  coluirn,  the   D   matrix 
is  strictly  lower  triangular. 

In  a  Similar  fashion,  each  k-dimensional  array  (with 
linear^  subscripts)  being  referenced  in  a  node  with  n  active 
no  loop  indices,   will   have   a   corresponding  k      by   (n*1) 
coefficient  matrix  C,   Let  E;  be  the  subscript  expression  of 
the  jth  dimension,  then 
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Definition--  Let  there  be  n  memory  units.  A  p-ordered 
N-vector  (mod  M)  is  defined  as  a  vector  of  N  elements  whose 
i-th  logical  element  is  stored  in  memory  unit  pi+c  (mod  M) 
where  c  is  an  arbitrary  constant. 
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The  idea  of  a  p-ordered  N-vector  is  V€ry  useful  in 
finding  the  number  of  cycles  required  to  access  a  vector  or 
to  aliqn  it  usinq  certain  alignment  networks. 

Usinq  a  qeneraliz^d  skewing  scheme  as  in  [Lawrie  1  ], 
for  an  array  with  k  dinrensions,  we  will  have  (m,  ,in2f . . . .  #m,^) 
skewing.   Assuming  an  array  operand  in  an   instruction   node 
has   k   dimensions   and   n  active  indices,  then  we  define  an 
order  vector,  V,  of  n+ 1  elements  as: 


< n  +  ^ >   « k >   ^ 

I     V     J=  [m,  m2..,mj 


n+1 


For  an  array  element  defined  by  any  particular  set 
of  values  fl, =h, , I2=h2 , . . . 1^=^^ } ,  the  element  will  be  stored 
in  memory  port  z,  where 
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z  =  C 


'h,' 

h2 

• 

• 

hn 

1 

In  Addition,  the  importance  of  the  order  vector  lies 
on  the  fact  that  for  any  partition  of  the  array  formed  by 
running  the  jth  active  index  parallel,  the  partition  is  a 
Vj  -ordered  vector  (rrod  M),  where  H  is  the  number  of 
ireniories.  When  the  order  and  nuirber  of  elements  of  a 
partition  are  calculated,  the  number  of  cycles  required  to 
access  and  aliqn  the  vector  can  be  easily  determined. 

For  Fortran  statements  that  cannot  be  easily 
dispatched  as  array  or  scalar  operations,  they  will  be 
grouped  as  recurrences  nodes.  Each  node  represents  a  R<n, m> 
system  {c.f.[22]).  Each  R<n,m>  system  will  be  broken  into 
as  many  smaller  recurrence  units  as  possible.  Information 
such  as  the  number  of  smaller  recurrence  units  and  the 
values  of  n  and  m  for  the  units  can  be  found  in  a  recurrence 
node.  With  this  information,  we  can  determine  which  is  the 
best  recurrence  solving  algorithm  to  use  and  its 
corresponding  execution  time. 
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U.2.2  Machine  Parameters 


■  In  the  parallel  architecture  that  He  simulate  we 
assume  that  the  resources  can  operate  in  an  overlapping  (or 
pipelining)  fashion.  However,  we  still  honor  the  dependency 
between  different  instruction  nodes.  Each  resource  will 
have  its  own  resource  queue  to  hold  the  waiting  requests. 
Hence  one  node  may  ne  using  the  alignment  network  while  an 
independent  node  can  start  fetching  its  operands  from  the 
memory  system. 

It  is  impossible  to  siirulate  every  known  parallel 
architecture.  So  we  concentrate  on  two  classes  of 
architectures.  The  first  class  is  shown  in  Figure  ^,2  and 
the  second  in  Figure  U.3.  Note  that  the  one  in  Figure  4.2 
resembles  that  assumed  in  Chapter  3.  The  second  type  has 
two  alignment  networks,  one  for  input  to  the  processing 
system  and  the  other  for  output  to  the  memory  system.  This 
class  can  be  chosen  by  setting  the  parameter  option 
M_PAPRM.  TW0_2^L_NfT  to  1. 

The  scalar  men^ory  and  scalar  processor  in  Figure  4.2 
and  U.  3  are  optional  and  can  be  chosen  by  setting  M__PARAM.SM 
and/or  M_PARAM.SP  to  1 's. 


The  number  of  processing  elements  in  the  processing 
array  and  the  number  of  memories  can  be  selected  using  the 
parameters  ?!_PARAM- NU«_PPOC  and  M_PARAM.  Ndm_M  EM. 
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Figure  4.2  Machine  Configuration  A 
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Figure  4.3  Machine  Configuration  B 
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The  skewing  system  chosen  can  be  specified  by 
assigning  values  to  the  array  M_PAEAM. WEM_Ma E.  For  example 
to  get  (1,1)  skewing,  we  would  put  the  numters  0,0,0,1,1 
into  the  hEm_MAP  array. 

As  for  the  alignment  network,  right  now  we  can 
choose  any  one  of  the  four  possible  networks  by  setting 
M_PARAM. AN_TYPe  to  the  appropriate  value. 


AN  TYPB 


=  1 

=  2 

=  3 

=  U 


crossbar 
omega  network 
*/P#11  shift  network 
♦I  shift  network 


The  memory  Cycie  time  can  be  specified  by  using 
H_PARAM,  MCYCLE_TIM'5,  and  the  scalar  memory  tiire  be  specified 
using  n_PAEAM. S_MEM_TIME.  To  allow  for  pipelining,  we  have 
two  separate  time  fields  for  each  resourc€  request.  The 
first  is  RT  which  contains  the  time  that  must  lapse  before 
another  request  for  the  same  resource  can  be  started.  The 
other  is  IT,  which  contains  the  time  required  to  finish 
processing  the  request.  If  a  particular  resource  is 
pipelined,  then  FT  will  be  the  pipeline  segmert  time  and  IT 
will  be  the  length  of  the  whole  pipe.  In  this  case,  IT  is 
greater  than  or  equal  to  et.  For  a  memory  request,  however, 
BT  will  be  the  cycle  time  and  IT  will  be  the  access  time. 
In  this  case,  IT  is  less  than  or  equal  to  RT. 
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The  alignment  times  are  specified  by  M__PARAH. A_CT_IN 
and  M_PAFAM.  A_CT__OOT.  The  processing  times  can  be  assigned 
by  the  user  using  the  array  H__PARAM,OP_TIHE.  The  elements 
are  times  reouired  for  simPle  assignment,  addition, 
subtraction,  multiplication,  and  division,  respectively.  tfe 
also  allow  the  users  to  define  their  own  built-in  function 
and  user  defined  function  times  in  n_PARAM. BUILTIN_TIME  and 
ri_PARAM.  nSE?FCN_TIME  respectively. 

A  sweeping  index  is  defined  to  be  the  active  index 
that  is  to  be  run  parallel  in  order  to  produce  the  desired 
partition,  one  option  that  the  user  has  is  tc  declare  what 
he  wants  as  the  sweeping  index.  The  other  is  to  let  the 
Simulator  choose  the  best  index  in  terms  of  execution  times. 
To  choose  the  first  option,  we  will  have  to  set 
?1_PARAM.  SWPCPT  to  0  and  to  set  the  array  M_P  ARAM.  SWEEP_INDX 
to  the  running  indices  reguired. 


For  standard  algorithmic  procedures,  such  as 
recurrence  handling,  the  operation  times  for  various 
resources  have  been  calculated  in  Chapter  3.  Thus  we  just 
need  to  substitute  in  the  formulas  for  the  operation  times 
rather  than  perform  detailed  simulation  of  the  algorithm. 
However,  we  will  be  missing  certain  overlap  parallelism. 
This  overlap  can  be  assigned  by  the  user  using 
M_PAPAM.CVFFLAP.  In  appropriate  cases,  IT  will  be  set  to 
OVKPLAp+RT/lOO,  and  will  be  less  than  RT. 
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In  this  section,  we  have  discussed  various  options 
that  are  available  to  the  users.  By  setting  all  the 
appropriate  options,  the  usee  has  defined  a  machine 
configuration  that  he  wants  to  study.  In  the  next  section, 
we  will  discuss  the  outputs  available  from  this  Simulator. 
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4.3  Sirnulator  Outputs 

T^e  output  of  the  si'iulator  is  a  set  cf  performance 
measures.  One  such  rreasure  is  Tp»  the  tine  required  for 
simulated  execution  of  the  program  graph  froir  the  Program 
Analyzer  in  the  specified  machine  organization  using  p 
processors.  If  T,  is  the  execution  time  for  the  same 
program  qraph,  then  we  dofine  another  measure,  the  speed 
factor, Fp,  as  T, /Tp.  In  addition,  the  Simulator  calculates 
measures  of  tne  utilizations  of  various  system  resources. 
The  utilization  of  each  resource  is  broken  down  into  several 
separate  utilizations,  U^,  Us  and  Uip  .  First,  U^ ,  the  array 
iiHtX  ^ycle,  is  the  percentage  of  time  that  at  least  one 
processor  is  pertorming  a  computation.  However,  whenever  an 
array  operation  is  being  performed,  only  some  of  the 
processors  may  be  actually  doing  useful  work.  This  is 
measured  by  the  slicing  ut  ilization,  Ug.  For  example,  to 
add  two  30  element  vectors  together  using  20  processors 
would  requiro  two  steps.  The  first  step  wculd  form  the 
first  20  ^^rr\s  and  ^ould  use  all  2C  processors  resulting  in  a 
slicing  utilization*  Ug ,  of  100%.  The  second,  step  would 
form  the  last  10  sums  using  only  1C  processors  and  would 
re'^ult  in  n^^sOi;.  The  overall  Ug  would  then  be  75%. 
Finally,  some  processors  are  turned  off  because  of  IF 
statements  in  the  original  programs,  and  this  is  measured  by 
Ojp  .  For  examole,  assume  that  in  the  following  program,  1/3 
of  the  B  (I)  are  less  than  zero: 
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CO  10  1=1,30 
10    IF  (B(T).GE.O)  A(I)=A  (I)  ^B  (I) 


Then  Ujp  =67%.  Thus,  using  20  processors  on  this  program,  Ua 
might  be  80%,  for  example,  because  the  processors  are 
waiting  for  memory  access  or  data  alignment.  Of  this  80%  of 
the  time,  only  75%  of  the  processors  could  be  used  because 
of  the  difference  between  the  number  of  processors  and  the 
array  size  Vh-"^^^)  r  ^"^  o^  these  75%  of  the  processors, 
only  67%  are  turned  on  (U^p  =67%) .  Thus,  the  total  average 
processor  duty  cycle,  Oj  ,  is  equal  tc  Og  *U5  *Uip  = 
80%*75%*67%  =  U0%.  By  separating  the  components  of 
processor  utilization  in  this  way  we  car  determine  the 
source  of  processor  inefficiencies. 
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U • ^    Sweeping  indices 

As  described  in  Section  4,2,2^  when  an  instruction 
node  can  be  swept  by  more  than  one  index,  there  are  two 
options  for  the  user  to  define  the  sweeping  irdex.  One  is 
to  specify  it  in  the  pararreter  W_PARAn,  SHEEP_INDX.  The 
other  is  to  let  the  RRG  choose  the  best  index  for  each 
individual  node.  When  there  are  other  indices  having  upper 
limits  that  depend  on  the  sweeping  index*  we  need  to  modify 
the  C  and  D  matrices  before  the  sweeping  is  allowed.  In 
other  words,  an  instruction  node  can  be  swept  on  an  index  I,- 
if  and  only  if,  in  the  D  matrix,  the  i-th  column  has  all 
zeroes. 

For  example,  if  the  instruction  node  looks  like: 

DO    T  =  0,N-1 

DO    J  =  G,I-»-k 

A  (I, J)  =  0 


then 


C  = 


1    0    0 
0    1    0 


and    D  = 


0  0    N-1 

1  0    k 


To  sweep  on  index  I,  we  need  to  expand  the  node  to 
two  nodes: 


DO   J  =  0,k-1 
DO   I  =  0,N-1 
A  {I,J)=0 


and 


EO   J  =  0,N-1 
CO   I  =  0,N-J-1 
A(I+J,J+k)=0 


119 


Now  there  are  two  sets  of  C  and   D   matrices.    They 


are: 


c, 

= 

1 

0 

0 

^0 

1 

0_ 

c, 

= 

'c 

c 

N-1 

.0 

0 

k-1 

Ca 

= 

'1 

1 

0^ 

.0 

1 

K 

D2 

= 

'c 

-1 

N-f 

0 

0 

N- 

-^ 

Note  that  after  this  transformaticn,  the  first 
column  of  D,  and  D2  are  all  zeroes*  fience  the  two 
transforired  nodes  can  be  swept  on  index  I. 

In  general,  given  a  node 

DO    1=0, N-1 
DO    J=0,hl+k 
A(I,J)=0 
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a^d    we   want    to    sweep   on    index    I,    the   original   loop    will   have 
to    be   transformed    into: 


DO    J=0,hN-h+k 
DC    1  =  0, f  (J) 
ft  (T,J)=0 
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The  first  problem  here  is  what  should  be  the 
equation  for  f (J)  in  general.  If  h  is  not  equal  to  1,  f ( J) 
Can  contain  many  irodulo  functions,  which  are  ronlinear,  and 
cannot  be  represented  easily  in  our  linear  D  iratrices. 
Another  problem  is  that  if  h  is  large,  many  of  the  vectors 
\(*,J)  will  be  sirall  vectors  which  can  seriously  degrade  the 
efficiencies  of  a  parallel  system.  So  the  solution  we 
picked  is  to  do  this  kind  of  +  ransf oriration  orly  if  h=1. 


above: 


Consider  a  more  general   case   than   the   one   shown 

CO  1=0, N-1 
Do  J=0,I+k 
A(pI+qJ+r,xI+yJ+z) =0 


i.e, 


and 


C  = 


D  = 


P  ^1 

X  y 

0  0    N- 1 

1  0    k 


We  can  split  the  node  into  two  parts: 


DO  1=0, N-1 
CO  J^0,k-1 
A  (pI*qJ  +  r,xI  +  yJ  +  z) =0 
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and  CO    1=0, N-1 

CO    J  =  '',I 
A(pI+qJ  +  r  +  qk,xn-yj  +  7  +  yk)=0 

The  first  part  thus  has: 


and 


^i  = 

p 

q 

r 

.X 

y 

z 

c.    = 

'c 

c 

N-l' 

0 

0 

k- 

-1 

=  c 


The   second    oart    is   equivalent    to: 
DO    J=o,N-1 
V.C    I  =  J,N-1 
A (pl+qJ+r+qk^xI+yJ+z+yk) =0 

and    is   also    equivalent    to: 

DO    J=0,N-1 

DO    I=^,N-J-1 

A(pl+  (p*-q)  J  +  r  +  qk,xI+  (x  +  y)  J  +  z-^yk)  =0 


■II 


It 
•  ^ 


or 


p         p+q         r+qk 
X         x-»-y         2  +  yk 


Do    = 


0         -1  N-1 

C  0  N-1 


Hence   C,  =  C  *  X 
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where    X  = 


1  1  0 
0  1  k 
0    0    1 


Therefore,  V,  =  M  *  C,  and  Vj  =  M  *  C2  will  form  a 
pair  of  order  vectors  that  can  determine  if  the  matrix  can 
be  swept  by  the  index  I. 

In  qeneral,  if  we  reduce  U:  from  Ij^~d€pendent  to  I^- 

independent,  X  will  be  a  (n+1)x(n+1)  matrix  with  ones  on  the 

diaqonal*  another  1  at  position  (j^h)  and  a   k   at   position 
(h,n-H)  . 
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4 . 5  Array  Slicing 


When  a"  instruction  node  represents  a  larger 
operation  than  the  processor  system  can  handle,  the  array 
operands  in  the  node  have  to  be  sliced.  Let  us  define  the 
required  number  of  slices  as  S.  The  slicirg  utilization, 
Ug ,  discussed  in  Section  4.3,  is  defined  as  the  percentage 
of  the  amount  of  a  resource  that  is  being  utilized.  These 
are  the  two  most  important  quantities  to  be  discussed  in 
this  section. 

When  the  upper  index  bounds  are  all  independent 
(i.e.  the  first  n  columns  of  D  are  all  0*s)  ,  it  is  easy  to 
find  S  and  11$  : 


S  =  fN/p]  IT  [D(I,n+1)+l) 


1  =  1 
Iftis 


Ds  =  N/(f  N/p]  *p)  *100% 

where  Ij  is  the  sweeping  index,  N=D  (i^  ,]-!••■  1)  ♦I  ,  and  P  is   the 
number  of  processors. 

After  transf oriring  the  loop  as  discussed  in  Section 
^,^,  no  upper  index  hound  will  be  dependent  on  Ij .  If, 
however,  the  upper  bound  of  I5  depends  on  index  h,  we  have 
to  calculate  S  and  Ug  differently.  if  we  have  this  kind  of 
instruction  node: 
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DO  h=0,  N-1 
DO  I  =0,  ah-«-b 
instruction 

a  rough  estimate  of  S  can  be   calculated   as   follows:    the 
average  upper  bound  of  Ig  is  aN/2+b.   Hence 

S    =   [(aN/2  +  b)/pl  ♦N 
and  Us    =    (aN/2<-b)/ (S*p)  ♦100% 

If  a=1,  we  have  an  upper  triangular  systeir  and  we 
can  find  more  accurate  values  for  S  and  u^  ,  Let  us  first 
consider  S  for  a  purely  triangular  system, as  shown  below: 


N 


Breaking  it  into  [n/p]  sets  of  columns-  The  first 
set  will  contribute  N  to  S.  Ihe  second  set  will  contribute 
(N-p)  to  S,  and  so  on.   So 


S  =Y^    (N-Pi) 
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=  rN/pl   fN-pCfN/pl  -1)/2). 


Mo**    let    Us    return    to    the   original    triangular   system, 
as    shown   below: 


N 


where    K=N+b- fb/pl  ^p   =N-(p-b)niod    p. 

The  first  half  contributes  N*  fb/pl  to  S.  The  second 
half  is  a  ourely  triangular  systerr  with  siz€  MxM,  and  thus 
contributes  Fm/pI  {M-p  (  Fm/pI  -1) /2}  to  S.   Therefore 

S=N  Fb/pl  +  M/D  {M-p(  M/P  -1)/2}. 

The    total    nurrber   of    elements    =   N(N  +  1*2b)/2.       Hence 

Hs  =  N  (N+1  +  2b)/(2Sp)  *100i;. 
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when  S  is  greater  than  one,  we  will  have  to 
replicate  the  resource  requests  S  times.  However,  in  order 
to  save  simulation  time,  we  will  devise  the  following 
orocedure. 


Ve  first  Observe  that  in  general,   each   slice   will 
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foJ-low  the  same  general  pattern:  a  fetch^  an  alignrnentr  a 
processor  cycle^  then  another  alignirent  and  finally  a  store. 
When  there  are  more  slices,  they  will  be  duplicates  of  the 
sequences*  but  with  slight  time  displacements,  like: 


F 
A 
P 
A 
S 


Thp  middle  part  of  the  operation  will  be  the 
concurrent  operations  of  F-A-P-A-S,  each  working  on  a 
different  slice.  It  is  repealed  (S-U)  times.  To  expedite 
simulation,  we  put  an  implied  DO  loop  around  this  middle 
part.  This  DO  loop  will  be  simulated  repeatedly  until  no 
other  request  node  is  using  any  system  resource.  Then  one 
more  iteration  will  be  done  to  figure  out  the  iteration 
time.  This  time  is  then  multiplied  by  the  number  of 
remaining  iterations  to  find  the  total  time.  This  method 
will  reduce  the  amount  of  parallelism  slightly;  however,  it 
reduces  the  simulation  time  greatly. 


Only  one  DO  loop  can  be  active  at  any  time   for   any 
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One  level  specified,  in  order  for  the  atove  simulation 
method  to  work.  This  can  be  achieved  by  generating  a 
resource  request  (for  that  level)  at  the  DO  node.  The 
resource  will  be  released  at  the  END  node#  when  the  required 
iterations  have  been  finished.  This  effectively  locks  out 
any  other  independent  DO  loop  activity  which  would  interfere 
with  tinging  the  "last"  iteration.  After  the  resource  is 
released,  another  DO  can  be  activated  by  being  granted  that 
resource. 
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U • 6  Resource  Time  Calculation 

A^ter  we  have  found  S  and  various  utilizations,  wg 
will  proceed  to  find  the  resource  times  of  various  needed 
resources  for  that  particular  instruction  node.  Scalar 
memory  and  processor  tiires  are  simple  to  calculate  and  we 
will  not  elaborate  on  these.  However,  recurrence  and 
ordinary  vector  operations  need  further  explanation.  Shift 
networks  nresent  a  different  set  of  calculation  and  will  be 
treated  in  a  separate  section. 

4.6.1  Becurrence  Handling 


For  recurrence  nodes,  we  have  analyzed  in  Chapter  3 
the  conditions  under  which  certain  recurrence  solving 
algorithms  should  be  used.  We  ^ave  alsc  f")':nd  th-i 
correspondinq  resource  times  for  -^ach  algorithm.  uence,  we 
can  save  a  iDt  of  simulation  time  by  simply  putting  in  the 
corresponding  resource  times  when  a  recurrence  node  is 
encountered.  This  way,  we  assumie  once  a  recurrence  node  is 
encountered,  we  will  preempt  the  machine  to  do  just  the 
recurrence,  fJsually  there  is  overlap  between  various 
resource  times,  i.e. ,  the  sum  of  all  the  resource  times 
should  be  greater  than  the  total  execution  tine.  To  account 
for  this  effect,  we  set  the  total  execution  time  to  a 
constant  parameter,  OVERLAP,  multiplied  by  the  sum  of  the 
resource   times.   This  OVERLAP  can  be  found  by  first  writing 
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the  recurrence  solving  alqorithm  in  Fortran  and  then  running 
it  through  the  Analyzer/Simulator.  The  average  OVERLAP  is 
calculated  for  various  array  sizes  and  machine 
Configurations.  This  rrethod  will  not  give  us  the  true  value 
for  the  execution  tiire  of  each  recurrence  calculation. 
HoWeVer,  it  "ill  give  us  a  moderately  reliable  estimate. 
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U . 6 . 2  Vector  Operations 

(asinq  crossbar  or  omega  network) 


For  a  vector  operation  node  using  crossbar  or  omega 
networks,  the  resource  times  are  easy  to  calculate  after  the 
order  vectors  (described  ir  Section  U.2.1)  for  the  operands 
are  calculated.  For  merrcry  accesses,  if  the  order  for  that 
particular  ooerand  on  a  particular  sweep  is  p,  then 
consecutive  eleirents  can  be  found  p  memory  modules  apart. 
Hence  we  need  g=qcd(p,M)  merrory  fetches  before  we  can  fetch 
the  entire  slice.  In  general,  the  number  of  memory  cycles 
required  to  access  a  p-ordered  N-vector  (defined  in  Section 
U.2.1)  stored  in  Jl  memories  =  fN*q/w|  .  After  each  memory 
cycle,  the  time  required  tor  aligning  a  p-ordered  vector 
using  a  crossbar  and  for  aligning  before  storing  into  a 
p-ordered  vector  using  omega  network  is  equal  to  1  network 
cycle.  Nevertheless,  to  fetch  a  p-ordered  vector  slice  using 
the  omega  network,  we  also  need  g  network  cycles.  Hence  the 
corresponding  total  nuirter  of  network  cycles  are: 

1)  crossbar  —  fN*g/M]  . 

2)  omega  --  gfN*g/ff|      for  fetching, 

rN*g/Ml       for  storing. 
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4.6.3  IIliac_Typo_5hif t  Networks 

For  processing  systems  with  N  processors,  if  the 
interprocessor  connections  are  + /n  and  ♦l-shifts,  we  call 
the  alignment  network  the  Illiac  type  shift  n€tworlc. 

F'or  this  type  of  network,  when  a  uniform  shift 
permutation  is  requested,  the  resource  time  is  easy  to 
calculate.  Let  s  be  the  shift  distance  required.  We  first 
set  ss=min (s, N-s) #  where  N  is  the  number  of  processors, 
Also  let  n  be  /n.   The  shift  time  required 


=  L^s/nJ  ♦  (ss  mod  n)        for  (ss  mod  n  <n/2), 

=  L^s/nJ  ■♦■n-H-(ss  mod  n)    for  (ss  mod  n  >n/2).     (4.1) 

When  a  triangular  type  array  is  accessed*  and  the 
shift  distance  for  each  slice  is  different,  we  have  to  find 
the  average  shift  time  for  the  array.  Let  A (x)  be  the 
average  shift  time  for  shift  distances  1,2, ...,%•  If  x=Nk-»'W 
where  N  is  the  number  of  processors,  then 


Nk  +  w 


(^.2) 


,   j!»l 


«8 

I'- 
ll 


where  D(w)  is  the  average  shift   time   for   shift   distances 
1,2,...,w,(w<N). 

We  will  first  calculate  wD(w). 


Let  d=D(n).   We  first  observe  the  regular  pattern  of 
the  shift  time  for  s=1,2,..,.n. 
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For   n=B,    shift    dist.  0123U5678 

Shift    tinie  0123U4321 

Summing  this  arithmetic  series,  we  will  get 

nd  =  2*(n/2)  (n/2+1)/2  =N/U+n/2,    for  n  even. 


and 


=  2*((n+1)/2)  ((n*1)/2*1)/2-(n+1)/2 


=  {n*^r /^, 


for  n  odd. 


C^.a) 


To  observe  the  shift  time  pattern,  we  will  show  the 
shift  tiire  for  various  s  using  N=16  processors. 

shift   dist.    0    1    2    3    iJ    5    6    7    8    9    10    11    12    13    ia    15 
time  0122123323      3      2      1       2      2      1 

Let  us  first  concentrate  on  the  shift  time  after  we 
arrive    at    the    reauired    n-partition. 

We  let  1-  \y/n\  and  y=w  mod  n.  z  will  be  the 
n-partition  number  while  y  is  the  number  within  a  particular 
n-partition.  we  set  y=y+1  for  z  >  n/2  to  account  for  the 
extra    shift    to    reach    the   other    end    of    the    n-partition. 


yD(y)    =y(y  +  '')/2 


for    0<y<n/2, 


yD(y)    =(n/2)  (n/2+1)/2^^  (n+1-i)  for    y>n/2. 


i=f+l 


=ny-y(y-1)/2-N/a 


('».'♦) 


We  then  calculate  the  times  required  to  get   to   the 
respective  n-partitions.   They  are: 
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tj  :   0,1,..., (n/2-1), (n/2-1),  (n/2-2) ,...,2,1. 

Let  w*f  be  t^he  total  number  of  n-shifts  we   have   to 
rio  for  all  s  in  the  first  z  n-partitions- 


if=ny^tf^nz(2-1)/2  for  C<z<n/2, 

i  =  t 

=n  (n/2)  (n/2-1)/2+n    (n-i)   for  n/2<z<n, 
=Nz-n2^/2-nz/2+nN/a. 


(4.5) 


The  total  nurrber  of  n-shifts  required  for  s   in   the 
n-partition  where  w  is  located  is  equal  to  q,  where 

q  =  min  (z,n-1-z) ♦y   for  z<n, 

=  0  for  z=n.  (a. 6) 

Hence  wD  (w)  =znd*yD  (y)  ^wf  "•"q* 

For  w  >  N/2,  wo  need  to  subtract  n/2  from   wD  (w) ,   since   we 
have  overcounted  the  shift  tiir^e  at  N/2. 


i.e.   wD(w)  =  znd+yD(y) +wf ^q       for  w<N/2, 
=  znd  +  yD  (y) +wf >q-n/2   for  w>N/2. 


(4.7) 


Now    we    want    to    find    D(N).       For    w=N,    2=n    and    y=0*-1=1. 

Hence    ND  (N) =n  (N/4  +  n/2)  ♦n- 1/2+ 1/2-N/a+nN-nN/2- K/2 
-nN/4+0-n/2 


:uis^ 


LlL 


:■.«£  ■ 


f1 


•I  fcl 


3 


«2 

:s 

"fl 


=nN/2-N/4-n/2 


C^.S) 


NOw  A  (x)  in  (U.2)  can  be  found  easily  once  ND  (N)   and   wD(w) 
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are  known. 


If  a  random  shift  is  required,  then  ar  average  time 
of  /n/2  will  be  used.  If  a  broadcast  function  is  needed, 
then  we  will  use  the  worst  case  result  of  |n. 


When  the  Perirutation  is  other  than  shift  or 
broadcast,  we  will  apply  Orcutt's  result  of  8(/n-1)  for 
omega  passable  permuta tions[ 27  ]. 
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H ,  7  Experi"'e"tal  Results 


Our  initial  experi^^ents  will  deal  with   the  effects 
of  the  following  architectural  pacameters: 

1)  The  number  of  array  processors,   and   the   speed  of   the 

processors    relative    to   the   array   ireirory  system. 
Initiallyr  the   processors   will   be   restricted   to   a 

single   group   of   processors   operating   from  a  single 
instruction  stream  (SIKD) . 


2)  The   presence   or   absence   of   an   independent    scalar 

processor  and/or  meirory*  The  absence  of  a  scalar 
processor  forces  scalar  operations  to  be  performed  by 
the  array  processors* 

3)  The  memory  system^  including   the   array   storage   scheme 

(1-skew,  etc.)»  and  the  number  of  memories (power  of  two 
or  prime) . 

U)    The  type  of  alignment  network: 
a)  Crossbar, 
h)    omega  network, 
c)  +i,  +  /p  shifter  (Illiac  IV), 


•'a 

..m 


.  .cfc 

I  3 


n 

■4 


iU 


These  parameters  will  te  studied  for  a  large  variety 
of  application  programs,  and  in  addition  the  size  of  the 
application  programs  (i.e.  the  array  sizes)  will  be  varied 
in  order  to  produce  fairilies  of  performance  figures. 
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The  tables  below  present  some  preliminary  results  of 
Experiments  on  three  prograins.  He  would  like  to  stress  at 
this  point  that  these  results  are  preliminary.  The  three 
programs  can  hardly  be  construed  as  representative  of  any 
large  population  of  applications.  The  first  program,  ADVV, 
is  a  U-point  relaxation  scheme.  ADVV  was  chosen  because  of 
its  highly  parallel  nature.  The  second  program,  ELMBAK, 
forms  the  eigenvectors  of  a  real  matrix  by  back  transforming 
those  of  the  corresponding  upper  Hessenberg  matrix.  ELMBAK 
is  reasonably  complicated,  but  has  no  recurrences.  The 
third  program,  sLeqI,  is  a  Gauss-Jordan  reduction  program. 
SLFQI  was  chosen  because  it  contains  a  recurrence  relation 
(a  P.<18,1>  system).  We  present  the  results  of  these  three 
programs  only  as  an  indication  of  the  types  of  results  we 
expect  frotr  our  experiments,  and  an  illustration  of  how  to 
interpret  the  results. 

The  complete   tables   of   experimental   results  for 

these   three   programs  are  shown  in  Appendix  E.   Some  of  the 

more  interesting  figures  are  grouped  together  in  Tables  4.1 

through  U.U. 


Table  4.1  shows  the  speed  factor,  Fp=T,/Tp,  and 
processor  utilization  Dj  using  16  processors,  17  memories,  a 
crossbar  alignment  network,  skewed  storage,  and  separate 
scalar  processor  and  mem.ory.  The  results  are  presented  as  a 
function  of  N,  the  data  array  sizes.   Notice  for   ACVV,   the 
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speed  factor  quickly  approaches  the  maximum  value  of  15. 
Processor  utilization  ranges  from  ^3%  to  71%,  The  result 
for  N=16  indicates  that  Ua  =  70%  (  Us=i00%  since  N=p=16  and 
for  ADVV,  Ujp  =1C0?E)  .  Thus, the  processors  are  only  busy  70% 
of  the  time  due  to  non-perfect  overlap  of  array  processor 
operations  with  alignment,  memory,  and  scalar  operations. 
However,  the  speed  factor  is  16  which  would  indicate  a 
similar  degree  of  non-perfect  overlap  in  a  conparable  serial 
processor.  The  other  programs,  ELHBAK  and  SLEQ1  indicate 
much  lower  speed  factors  and  utilizations.  SLEQ1  contains 
recurrences*  which  are  handled  in  parallel  but  much  less 
efficiently  than  the  pure  vector  operations  in  ADVV. 
Notice,  however,  that  even  though  SLECl  contains  a 
recurrence,  the  speed  factor  of  1*^.5  is  very  close  to  the 
maximum  of  16  when  N  is  60.  We  believe  it  is  significant 
that  we  are  able  to  handle  recurrences  this  well. 


The  reason  the  ELMEAK  results  are  so  low  illustrates 
an  interesting  situation.  At  the  present  tine  programs  are 
compiled  into  three  address  vector  or  scalar  instructions. 
If  the  vectors  are  of  sufficient  length,  then  an  implicit 
loop  is  established  in  order  to  cycle  the  processors, 
memories,  etc.  a  sufficient  number  of  times.  Within  this 
implicit  loop  there  is  usually  overlap  between  processor, 
alignment  and  memory  operations.  However,  between  separate 
vector  instructions,  there  is  no  overlap.  Thus,  one 
instruction   must   finish   before   the  next  starts.   This  is 
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what  causes  the  low  figures  for  ELMBAK.  This  indicates  to 
us  that  it  is  inrportant  to  design  the  vectcr  instructions 
and  control  unit  so  that  different  vector  instructions 
overlap  each  other. 


It  is  also  interesting  to  note  that  Pp  and  Uj 
continue  to  increase  with  N  for  both  ELMBAK  and  SLEQI.  This 
is  due  to  increased  overlap  of  operations  within  the  implied 
loops  of  vector  instructions  and,  in  SLEQI,  more  efficient 
recurrence  algorithms  which  are  used  when  N  is  sufficiently 
larger  than  the  number  of  processors. 

Table  U.2  indicates  the  effectiveness  of  various 
alignment  networks  and  skewing  schemes.  As  we  can  see,  the 
crossbar  and  omega  networks  performed  egually  well.  The 
Illiac  network  perforired  somewhat  better,  at  least  for  ADVV 
and  ELKBAK,  This  is  due  to  two  facts.  First,  the  Illiac 
network  was  set  to  operate  four  times  faster  than  the  other 
networks.  This  reflects  the  difference  in  the  complexity  of 
the  networks.  Second,  we  were  able  to  "compile"  the 
programs  using  very  simple  alignm.ent  requirements  which 
could  be  easily  handled  by  all  three  networks.  The  lack  of 
difference  between  staight  storage  (0,1)  and  skewed  storage 
(1,1)  is  also  a  reflection  of  this  second  foint.  He  were 
able  to  compile  the  program.s  so  they  only  needed  access  to 
rows,  and  thus  they  do  not  benefit  from  skewed  storage. 
However,  we  do  not  believe  this  result  will  held  for  larger. 
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Illiac  IV 
Straight   Skewed 

1384 
(9.9) 

1282 
(2.8) 

-K 

1384 
(9.9) 

1261 
(2.8) 

^C 

Omega 
Straight    Skewed 

1416 
(9.7) 

1760 
(2.0) 

2644 
(4.5) 

1416 
(9.7) 

1760 
(2.0) 

2644 
(4.5) 

ssbar 
Skewed 

1416 
(9.7) 

1760 
(2.0) 

2644 
(4.5) 

Crc 
Straight 

1416 
(9.7) 
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more  complicated  programs. 


Table  ^^,3  illustrates  another  interesting  result. 
One  question  which  continually  plagues  machine  designers 
concerns  the  relative  speed  of  the  memory  and  processor. 
Should  the  memory  be  the  same  speed  as  the  processor,  twice 
as  fast*  or  three  tildes  as  fast?  The  answer  depends  on  many 
things:  the  design  of  the  machine  instr^ctiors,  the  size  of 
arithmetic  expressions  in  the  source  program,  etc-  Table 
U.3  shows  thG  execution  time,  Tp,  and  processor  utilization, 
IJj ,  for  three  different  cases.  In  column  1,  the  processor 
array,  alignment  network,  and  memory  all  have  the  same  cycle 
time.  In  column  2,  the  alignment  network  alone  has  been 
made  twice  as  fast.  There  is  very  little  difference  between 
coluiTins  1  and  2,  This  is  because  the  faster  crossbar  switch 
is  only  effective  when  data  alignirents  are  required  in  the 
absence  of  memory  accesses.  None  of  these  three  programs 
required  such  alignirent.  The  small  diference  present 
between  columns  1  and  2  sirrply  represents  a  shorter  overall 
time  for  a  "short"  vector  operation  in  the  absence  of 
inter-instruction  overlap. 


,1    ^\. 

■    1 
l'  J 

fi*  * 

9 
9 
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Column  3  of  Table  U.l  corresponds  to  a  machine  whoSe 
alignment  network  and  memories  are  twice  as  fast  as  the 
processor  array.  For  ADvv,  the  improvement  in  Tp  is 
noticeable  but  not  significant.  This  is  because  ADVV  has 
relatively  large  expressions  in  the  source   program   so   the 
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ratio  of  memory  to  processor  operations  is  close  to  1:1. 
"Thus,  ADVV  does  not  need  a  very  fast  memory.  For  ELMBAK  and 
SLEQ1,  however,  the  improvement  in  T  is  more  significant. 
This  would  indicate  that,  at  least  for  these  programs,  the 
faster  memory  might  be  cost  effective. 

Table  'i,^  shows  the  effectiveness  of  an  independent 
scalar  processor  and  scalar  memory  on  Tp,  Also  included  in 
the  table  are  the  utilizations  of  scalar  memory  and  scalar 
processor  respectively.  A  scalar  processor  and  memory 
should  be  effective  for  several  reasons.  First,  without  a 
scalar  memory,  when  a  scalar  is  being  broadcast  over  all 
elements  of  an  array*  the  scalar  operand  would  have  to  be 
fetched  from  the  array  iremory  and  aligned  (broadcast).  This 
constitutes  wasteful  use  of  the  array  memory.  Second,  the 
Use  of  both  scalar  memory  and  processor  would  allow  some 
scalar  operation  to  be  done  siirultaneously  with  array 
operations.  Thus  we  would  be  able  to  overlap  or  mask  out 
certain  truckulent  serial  operations  in  the  program. 


In  Table  U.a  we  can  see  that  the  scalar  processor 
causes  no  improvement  in  To  and  the  scalar  meirory  results  in 
only  marginal  improvement,  even  though  both  are  utilized  to 
some  extent.  However,  we  believe  that  our  "ccmpiler"  can  be 
improved  so  as  to  utilize  the  scalar  hardware  more 
effectively.  This  will  involve  inproving  the 
inter- instruction  overlap  and  more  accurate   accounting   for 
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Program 

Col  1 

Col  2 

Col  3  , 

ADVV 

1416 
43% 

1384 
44% 

1036 
58% 

ELMBAK 

1760 
6% 

1650 
7% 

1023 
10% 

SLEQl 

2644 
14% 

2590 
14% 

1606 
23% 

Table  4»3  The  effects  of  the 
relative  differences  in  memory, 
alignment,  and  processor  cycle 

times  on  T  and  U^    ,^. 

p      T.   (16  pro- 
cessors, crossbar  switch,  N=10, 
and  skevv/ed  storage.) 


Scalar 

Scalar 

Both  scalar 

memory 

processor 

processor 

Program 

Neither 

only 

only 

and  memory 

ADVV 

1544 

1416 

1544 

1416 

~ »  ~ 

12%,  - 

-,  3% 

12%,  3% 

ELMBAK 

1854 

1760 

1854 

1760 

~  1  ~ 

12%,  - 

-,  12% 

12%,  12% 

SLEQ  1 

2768 

2644 

2768 

2644 

~  >  ~ 

8%,  - 

-,  9% 

8%,  9% 

Table  4,4  The  effect  on  execution  time,  Tp,  of  a  scalar  memory  and  scalar 
processor.   The  percentage  figures  are  scalar  memory  utilization  and 
scalar  processor  utilization  respectively. 
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such  things  as  subscript  calculation. 


The  above  discussion  is  based  Only  on  limited  amount 
of  experimental  results.  It  can  only  be  regarded  as  an 
illustration  of  how  to  interpret  the  results.  According  to 
which  "benchmark"  programs  a  user  is  interested  in,  he  can 
conduct  experiments  using  that  set  of  programs.  In  the  end, 
he  will  then  be  able  to  determine  on  the  kind  of  machine 
configurations  that  is  most  suitable  for  him. 
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5.   CONCLUSION 


This  thesis  concerns  the  utilization  and 
effectiveness  of  interprocessor  connection  networks  for 
parallel  (SLID)  type  computers.  The  problems  concerning 
interconnection  networks  can  be  divided  into  three  areas: 
capabilities,  exploitation,  and  effectiveness. 


Capabilities  include  network  properties  and  network 
control  methods.  One  of  the  networks  that  we  have  examined 
closely  is  the  omeqa  network.  The  omeqa  network  is  one  of 
the  more  attractive  multistage  networks,  it  is  moderate  in 
network  coirplexity  and  quite  powerful  in  its  permutation 
capabilities.  If  we  concentrate  on  only  seme  of  the  more 
common  permutations,  we  can  further  reduce  the  complexity  of 
its  control  algorithms.  Three  different  control  methods  are 
shown  in  this  thesis.  He  have  discussed  a  significant  new 
property  of  the  omega  n<=*twork,  the  partitioning  property. 
We  showed  that  a  large  size  omega  network  can  be  regarded  as 
a  conglomeration  of  irany  smaller  size  omega  networks,  each 
passing  a  different  smaller  omega-passable  connection 
function.  This  partitioning  property  of  the  omega  network 
proves  to  be  vital  tor  the  efficient  handling  of  many 
computation  algorithms.  We  also  discussed  another  important 
property  of  the  omega  network,  the  broadcastirg  ability.  He 
showed  the  conditions  under  which  a  2-dimensional  data  array 
can  be  broadcast  to  a  J-dimensional   data   array   using  the 
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omega  network.  This  data  transfer  ability  is  necessary  for 
example  in  certain  matrix  multiplication  and  recurrence 
solving  algorithms,  in  Chapter  2,  we  were  also  able  to 
extend  the  capabilities  of  the  omega  network  further  using 
the  concept  of  linear  permutations. 

The  shuffle  connection  is  the  basis  of  many 
interconnection  networks,  like  the  omega  network,  the 
Batcher  network,  and  the  binary  Eenes  network.  Because  of 
this  similarity,  we  can  apply  some  of  the  properties  of  one 
network  to  the  others,  and  hence  increase  further 
Understanding  of  such  networks.  Some  such  extensions  are 
shown  in  Section  2.3  and  2.U. 


Because  of  the  great  simplicity  in  gate  counts,  the 
one  stage  perfect  shuffle  exchange  network  is  also  carefully 
examined.  Algorithms  were  presented  in  Section  2.5  to  show 
how  such  network  can  be  used  for  performing'  permutations. 

with  the  old  and  new  knowledge  we  have  acguired  on 
network  capabilities,  we  would  like  to  apply  them  to  some 
common  computations.  Recurrence  solving  algorithms  and 
matrix  multiplication  algorithms  are  two  examples  that  we 
used  in  Chapter  3.  The  efficient  handling  of  recurrence 
operations  is  essential  in  parallel  processing  systems 
because  the  parallel  system  will  be  degraded  to  a  serial 
machine  otherwise.  With  careful  planning,  we  were  able  to 
simplify  the  alignment  requirements   of   various   recurrence 
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solving  algorithms.  So,  instead  of  a  full  crossbar*  wg  can 
now  use  a  simple  alignment  network,  such  as  the  omega 
network.  In  Section  3.3,  we  show  how  a  comrron  computation 
algorithm  (such  as  matrix  multiplication),  if  detected,  can 
be  adapted  onto  a  parallel  processing  systeir  equipped  with 
only  a  one  stage  network.  Hence  Chapter  3  has  been 
dedicated  to  techniques  for  exploiting  various 
interconnection  networks. 

To  evaluate  the  tru©  effectiveness  of  a  particular 
interconnection  network,  we  have  to  determine  the 
effectiveness  on  real  programs  of  a  parallel  processing 
system  equipped  with  such  a  network.  This  can  be  achieved 
with  the  help  of  the  Analyzer/Simulator  project  currently 
being  developed.  The  program  analyzer  first  generates  a 
highly  parallelized  version  of  the  program.  Then  the  RRG 
will  compile  it  into  suitable  pseudo  machine  code  from  the 
information  about  the  parallel  processing  systems  that  the 
user  defines.  This  pseudo  compilation  is  dore  based  on  the 
capabilities  of  the  architecture  to  be  studied,  including 
the  type  of  interconnection  network  used.  From  the  results 
of  the  simulator,  we  can  determine  how  well  does  an 
interconnection  network  work- 
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with  the  methodology  described  in  this  thesis,  the 
true  effectiveness  of  an  inte rprocessor  connection  can  then 
be  determined. 
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We  conclude  this   thesis   by  giving   the   following 
topics  that  are  worthwhile  for  further  research: 

1)  IS  there  a  set  of  basic  permutation  patterns  for  the 
omega  ROM  control  method  from  which  other  useful 
permutation  patterns  can  be  generated  by  doing  logical 
o^'prations  on  some  members  of  the  set?  For  example,  how 
can  we  generate  the  control  pattern  of  a  k-shifted 
p-ordered  spread  permutation  from  that  of  a  k-shift 
permutation  and  that  of  a  p-ordered  permutation? 

2)  Finding  the  analytic  bounds  and  averages  for  the  time 
required  to  pass  a  permutation  using  a  one  stage  perfect 
shuffle  network  is  a  worthwhile  project. 

3)  We  are  able  to  adapt  recurrence  solving  algorithms  on  a 
parallel  processing  system  equipped  with  a  (log  n)-stage 
network.  A  highly  significant  result  would  be  to  show 
how  recurrence  algorithms  could  be  handled  a  one-stage 
network  (perhaps  coupled  shuffle  and  shift  connections). 
This  would  lead  to  a  very  cheap  and  effective  prallel 
architecture. 

H)  Control  unit  times  should  be  carefully  added  to  the 
Simulator.  Subscript  calculations  and  register  and 
scalar  usages  should  be  accounted  for  more  accurately. 
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APPENDIX  A 


We  will  first  show  a  rrultiplicat  ion  algorithm  that 
takes  N  processors  and  2N  uieinoiies.  The  skewing  scheme  is 
/TT+l-skew  2-slcip.  .^t^mories  2i  and  {2i-»-N+ /n+ 1)  mod  (2N)  ara 
connected  to  processor  i.  fill  even  PA's  refer  to  memory  2i 
and  all  odd  RA's  refer  to  memory  (2i  +  N-«- /TI  +  1)  mod  (2N)  .  An 
illustration   of    the   memory    map    is    shown    in    Figure    A,1, 
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Figure    \,}     yN  +  1-Sk.ew    2-Skip    Scheme 

A  l£0  r  it:  h  m : 

A.    Pepeat    for    IC=0    to    N-1; 

1.  Fetch    A,    P^^IC,     into    t?  1  . 

2.  Set    k=r  (    N  +  1) *IC  ]mod (2N)  . 

3.  If    k    is   odd    then    k=  (h-N-    N-1)nod(2N). 
^.    k=k/2. 

5.  IR= (PPN-k) modN. 

6.  Repeat    for    IT=0    to    N-1; 

a.    fetch    n,     ^A=IR,    into    P2. 
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b.  multiply    P1    and    P2    into    R3. 

c.  G-permute    IR. 

d.  J=(PPN-(    N  +  1)     IR/2    )modN. 

e.  if    IR    is    odd    then    J=  (J  +  N/2)  inodN 

f.  store    R3    into    T,    RA=J. 
q.    G-pt»rrnutc    ?1. 

7.    Sot    R1=C. 

3,    Repeat   for    I'"=0    to    N-1; 

a.  fetch    T,    Bk=lh,    into    PA. 

b.  add    R"!    and    F2    into    F1. 

c.  G-permute    HI, 

d.  G-PGcnute    T^. 

9.     Store    PI    into    C,     RA=IC. 


P.    Done. 


Another    algorithm   also    uses    2N   memories,       except      now 
it    uses    a    l-skew    ?-skip   storage   scheme. 


For  this  scheme,  memories  2i  and  ( 2i+ K+ 1) mod (2N)  are 
connected  to  processor  i.  The  iremory  scheme  is  illustrated 
in    Figure    A. 2. 
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A 1  c[o  r  i  t  hrr : 

\,    Rf?peat    for    IC^O    to    N-1; 

1.  Fetch    ^,     P»-TC,     into    F1. 

2.  If    IC    is    old    then    k=  ( TC-N- 1 ) mod  (2N)  . 

3.  k=k/?. 

a.    IE=(FPM-k)  niodN. 

5.  r^epeat    for    IT  =  0    to    N-1; 

a.  fetch    B,     RA=IB,    into    R2. 

b.  inultiply    K1    and    P2    into    P. 3. 

c.  G-permute    I^.. 

d.  J=(PFN-    IP/2    )inodN. 

e.  if    12    is    odd    then    J=  (J  +  N/2)  inodN. 

f.  store    P^    into    T,    BA=J. 
q,    G-permute    R  1 . 

6.  Sf^t    P1=0. 

7.  Repeat    for    IT  =  "!    to    II- 1; 

a.    fetch    T,     RA=IE,    into    HA. 
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b.  add   Pi    and    R2    into    Rl. 

c.  G-permute    HI, 

d.  G-permute    IR. 

8.    Store    Rl    into    C,    RA=IC. 


B.    Done, 
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The  following   twelve   tables   are   the   experimental 
results  of  three  Fortran  programs;   ADVV,  ELMBAK,  and  SLEQI, 


The  column  AN  shows  what  alignment  network  is  chosen. 
The  column  (H=)  indicates  whether  the  number  of  menories  is 
prime  or  not.  The  column  SKEW  shows  the  skewing  scheme  being 
chosen.  Finally,  the  column  SPEED  (P/A/M)  gives  the  relative 
speeds  of  processor,  alignment  network  and  memory.  As  for 
the  headings,  the  number  of  processors  is  given  as  P=icx. 
SP/SiM  indicates  whether  the  switches  SP  and  Sm  (c.f.  Section 
U.2.?)  are  turned  on  or  not. 

For  the  tabale  on  Execution  Time(Tp)  and  Spaed 
Factor  (Fp)  ,  the  main  number  in  each  entry  is  To  and  the 
number  in  parentheses  is  Fp .  The  remaining  tables  show 
utilization  measures  of  various  system  resources.  They  are 
in  the  order:  array  memory,  input  alignment  network,  output 
alignment  network,  vector  processor  system,  scalar  memory  and 
scalar  processor. 
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when  the  entries  in  some  row  j  is  the  same  as  that  of 
row  i,  it  will  be  marked  "same  as  row  i". 
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